# Full text of "Perceptrons"

## See other formats

Expanded Edition Perceptrons Marvin L. Minsky Seymour A. Papert Marvin Minsky and Seymour Papert Perceptrons An Introduction to Computational Geometry Expanded Edition ' PHD illlBisiff Of OH AH RAN -31261 UJDI EMU ARABIA X, YY\emor 1 If Rosen IjUft // r MS 5' II b> v Third printing, 1988 Copyright 1969 Massachusetts Institute of Technology. Handwritten alterations were made by the authors for the second printing (1972). Preface and epilogue copyright 1988 Marvin Minsky and Seymour Papert. Expanded edi- tion, 1988. All rights reserved. No part of this book may be repro- duced in any form by any electronic or mechanical means (including photocopying, recording, or information stor- age and retrieval) without permission in writing from the publisher. This book was set in Photon Times Roman by The Science Press, Inc., and printed and bound by Halliday Lithograph in the United States of America. Library of Congress Cataloging-in-Publication Data Minsky, Marvin Lee, 1927— Perceptrons : An introduction to computational geometry. Bibliography: p. Includes index. 1. Perceptrons. 2. Geometry — Data processing. 3. Parallel processing (Electronic computers) 4. Machine learning. I. Papert, Seymour. II. Title. Q327.M55 1988 006.3 87-30990 ISBN 0-262-63111-3 (pbk.) CONTENTS Prologue: A View from 1988 vii 0 Introduction 1 i Algebraic Theory of Linear Parallel Predicates 21 1 Theory of Linear Boolean Inequalities 25 2 Group Invariance of Boolean Inequalities 39 3 Parity and One-in-a-box Predicates 56 4 The “And/Or” Theorem 62 II Geometric Theory of Linear Inequalities 69 5 ib : A Geometric Property “connected r J with Unbounded Order 73 6 Geometric Patterns of Small Order: Spectra and Context 96 7 Stratification and Normalization 114 8 The Diameter-limited Perception 129 9 Geometric Predicates and Serial Algorithms 136 hi Learning Theory 149 10 Magnitude of the Coefficients 151 11 Learning 161 12 Linear Separation and Learning 188 13 Perceptions and Pattern Recognition 227 Epilogue: The New Connectionism 247 Bibliographic Notes 281 Index 295 Prologue: A View from 1988 This book is about perceptrons — the simplest learning machines. However, our deeper purpose is to gain more general insights into the interconnected subjects of parallel computation, pattern recog- nition, knowledge representation, and learning. It is only because one cannot think productively about such matters without studying specific examples that we focus on theories of perceptrons. In preparing this edition we were tempted to “bring those theories up to date.” But when we found that little of significance had changed since 1969, when the book was first published, we con- cluded that it would be more useful to keep the original text (with its corrections of 1972) and add an epilogue, so that the book could still be read in its original form. One reason why progress has been so slow in this field is that researchers unfamiliar with its history have continued to make many of the same mistakes that others have made before them. Some readers may be shocked to hear it said that little of significance has happened in this field. Have not perceptron-like networks — under the new name connectionism — become a major subject of discussion at gatherings of psycholo- gists and computer scientists? Has not there been a “connectionist revolution?” Certainly yes, in that there is a great deal of interest and discussion. Possibly yes, in the sense that discoveries have been made that may, in time, turn out to be of fundamental impor- tance. But certainly no, in that there has been little clear-cut change in the conceptual basis of the field., The issues that give rise to excitement today seem much the same as those that were re- sponsible for previous rounds of excitement. The issues that were then obscure remain obscure today because no one yet knows how to tell which of the present discoveries are fundamental and which are superficial. Our position remains what it was when we wrote the book: We believe this realm of work to be immensely important and rich, but we expect its growth to require a degree of critical analysis that its more romantic advocates have always been reluc- tant to pursue — perhaps because the spirit of connectionism seems itself to go somewhat against the grain of analytic rigor. In the next few pages we will try to portray recent events in the field of parallel-network learning machines as taking place within the historical context of a war between antagonistic tendencies called symbolist and connectionist. Many of the participants in this history see themselves as divided on the question of strategies for [viii] Prologue thinking — a division that now seems to pervade our culture, engag- ing not only those interested in building models of mental functions but also writers, educators, therapists, and philosophers. Too many people too often speak as though the strategies of thought fall naturally into two groups whose attributes seem diametrically op- posed in character: symbolic logical serial discrete localized hierarchical left-brained connectionist analogical parallel continuous distributed heterarchical right-brained This broad division makes no sense to us, because these attributes are largely independent of one another; for example, the very same system could combine symbolic, analogical, serial, continuous, and localized aspects. Nor do many of those pairs imply clear opposites; at best they merely indicate some possible extremes among some wider range of possibilities. And although many good theories begin by making distinctions, we feel that in subjects as broad as these there is less to be gained from sharpening bound- aries than from seeking useful intermediates. The 1940s: Neural Networks The 1940s saw the emergence of the simple yet powerful concept that the natural components of mind-like machines were simple abstractions based on the behavior of biological nerve cells, and that such machines could be built by interconnecting such ele- ments. In their 1943 manifesto “A Logical Calculus of the Ideas Immanent in Nervous Activity,” Warren McCulloch and Walter Pitts presented the first sophisticated discussion of “neuro-logical networks,” in which they combined new ideas about finite-state machines, linear threshold decision elements, and logical represen- tations of various forms of behavior and memory. In 1947 they published a second monumental essay, “How We Know Univer- sal, ” which described network architectures capable, in princi- ple, of recognizing spatial patterns in a manner invariant under groups of geometric transformations. Prologue [ix] From such ideas emerged the intellectual movement called cyber- netics, which attempted to combine many concepts from biology, psychology, engineering, and mathematics. The cybernetics era produced a flood of architectural schemes for making neural net- works recognize, track, memorize, and perform many other useful functions. The decade ended with the publication of Donald Hebb’s book The Organization of Behavior, the first attempt to base a large-scale theory of psychology on conjectures about neural networks. The central idea of Hebb’s book was that such networks might learn by constructing internal representations of concepts in the form of what Hebb called “cell-assemblies” — subfamilies of neurons that would learn to support one another’s activities. There had been earlier attempts to explain psychology in terms of “connections” or “associations,” but (perhaps because those connections were merely between symbols or ideas rather than between mechanisms) those theories seemed then too insub- stantial to be taken seriously by theorists seeking models for men- tal mechanisms. Even after Hebb’s proposal, it was years before research in artificial intelligence found suitably convincing ways to make symbolic concepts seem concrete. The 1950s: Learning in Neural Networks The cybernetics era opened up the prospect of making mind-like machines. The earliest workers in that field sought specific archi- tectures that could perform specific functions. However, in view of the fact that animals can learn to do many things they aren’t built to do, the goal soon changed to making machines that could learn. Now, the concept of learning is ill defined, because there is no clear-cut boundary between the simplest forms of memory and complex procedures for making predictions and generalizations about things never seen. Most of the early experiments were based on the idea of “reinforcing” actions that had been successful in the past — a concept already popular in behavioristic psychology. In order for reinforcement to be applied to a system, the system must be capable of generating a sufficient variety of actions from which to choose and the system needs some criterion of relative success. These are also the prerequisites for the “hill-climbing” machines that we discuss in section 11.6 and in the epilogue. Perhaps the first reinforcement-based network learning system was a machine built by Minsky in 1951. It consisted of forty electronic [x] Prologue units interconnected by a network of links, each of which had an adjustable probability of receiving activation signals and then transmitting them to other units. It learned by means of a reinforce- ment process in which each positive or negative judgment about the machine’s behavior was translated into a small change (of cor- responding magnitude and sign) in the probabilities associated with whichever connections had recently transmitted signals. The 1950s saw many other systems that exploited simple forms of learning, and this led to a professional specialty called adaptive control. Today, people often speak of neural networks as offering a promise of machines that do not need to be programmed. But speaking of those old machines in such a way stands history on its head, since the concept of programming had barely appeared at that time. When modern serial computers finally arrived, it became a great deal easier to experiment with learning schemes and “self- organizing systems.” However, the availability of computers also opened up other avenues of research into learning. Perhaps the most notable example of this was Arthur Samuel’s research on programming computers to learn to play checkers. (See Biblio- graphic Notes.) Using a success-based reward system, Samuel’s 1959 and 1967 programs attained masterly levels of performance. In developing those procedures, Samuel encountered and de- scribed two fundamental questions: Credit assignment. Given some existing ingredients, how does one decide how much to credit each of them for each of the machine’s accomplishments? In Samuel’s machine, the weights are assigned by correlation with success. Inventing novel predicates. If the existing ingredients are inade- quate, how does one invent new ones? Samuel’s machine tests products of some preexisting terms. Most researchers tried to bypass these questions, either by ignor- ing them or by using brute force or by trying to discover powerful and generally applicable methods. Few researchers tried to use them as guides to thoughtful research. We do not believe that any completely general solution to them can exist, and we argue in our epilogue that awareness of these issues should lead to a model of mind that can accumulate a multiplicity of specialized methods. Prologue [xi] By the end of the 1950s, the field of neural-network research had become virtually dormant. In part this was because there had not been many important discoveries for a long time. But it was also partly because important advances in artificial intelligence had been made through the use of new kinds of models based on serial processing of symbolic expressions. New landmarks appeared in the form of working computer programs that solved respectably difficult problems. In the wake of these accomplishments, theories based on connections among symbols suddenly seemed more satis- factory. And although we and some others maintained allegiances to both approaches, intellectual battle lines began to form along such conceptual fronts as parallel versus serial processing, learn- ing versus programming, and emergence versus analytic descrip- tion. The 1960s: Connectionists and Symbolists Interest in connectionist networks revived dramatically in 1962 with the publication of Frank Rosenblatt’s book Principles of Neurodynamics, in which he defined the machines he named per- ceptrons and proved many theories about them. (See Bibliographic Notes.) The basic idea was so simply and clearly defined that it was feasible to prove an amazing theorem (theorem 11.1 below) which stated that a perceptron would learn to do anything that it was possible to program it to do. And the connectionists of the 1960s were indeed able to make perceptrons learn to do certain things — but not other things. Usually, when a failure occurred, neither prolonging the training experiments nor building larger machines helped. All perceptrons would fail to learn to do those things, and once again the work in this field stalled. Arthur Samuel’s two questions can help us see why perceptrons worked as well as they did. First, Rosenblatt’s credit-assignment method turned out to be as effective as any such method could be. When the answer is obtained, in effect, by adding up the contribu- tions of many processes that have no significant interactions among themselves, then the best one can do is reward them in proportion to how much each of them contributed. (Actually, with percep- trons, one never rewards success; one only punishes failure.) And Rosenblatt offered the simplest possible approach to the problem of inventing new parts: You don’t have to invent new parts if [xii] Prologue enough parts are provided from the start. Once it became clear that these tactics would work in certain circumstances but not in others, most workers searched for methods that worked in general. However, in our book we turned in a different direction. Instead of trying to find a method that would work in every possible situation, we sought to find ways to understand why the particular method used in the perceptron would succeed in some situations but not in others. It turned out that perceptrons could usually solve the types of problems that we characterize (in section 0.8) as being of low “order.” With those problems, one can indeed sometimes get by with making ingredients at random and then selecting the ones that work. However, problems of higher “orders” require too many such ingredients for this to be feasible. This style of analysis was the first to show that there are fundamen- tal limitations on the kinds of patterns that perceptrons can ever learn to recognize. How did the scientists involved with such mat- ters react to this? One popular version is that the publication of our book so discouraged research on learning in network machines that a promising line of research was interrupted. Our version is that progress had already come to a virtual halt because of the lack of adequate basic theories, and the lessons in this book provided the field with new momentum — albeit, paradoxically, by redirecting its immediate concerns. To understand the situation, one must recall that by the mid 1960s there had been a great many experiments with perceptrons, but no one had been able to explain why they were able to learn to recognize certain kinds of patterns and not others. Was this in the nature of the learning procedures? Did it depend on the sequences in which the patterns were presented? Were the machines simply too small in capacity? What we discovered was that the traditional analysis of learning machines — and of perceptrons in particular — had looked in the wrong direction. Most theorists had tried to focus only on the mathematical structure of what was common to all learning, and the theories to which this had led were too general and too weak to explain which patterns perceptrons could learn to recognize. As our analysis in chapter 2 shows, this actually had nothing to with learning at all; it had to do with the relationships between the perceptron’s architecture and the characters of the problems that were being presented to it. The trouble appeared when perceptrons Prologue [xiii] had no way to represent the knowledge required for solving certain problems. The moral was that one simply cannot learn enough by studying learning by itself; one also has to understand the nature of what one wants to learn. This can be expressed as a principle that applies not only to perceptrons but to every sort of learning ma- chine: No machine can learn to recognize X unless it possesses, at least potentially, some scheme for representing X. The 1970s: Representation of Knowledge Why have so few discoveries about network machines been made since the work of Rosenblatt? It has sometimes been suggested that the “pessimism” of our book was responsible for the fact that connectionism was in a relative eclipse until recent research broke through the limitations that we had purported to establish. Indeed, this book has been described as having been intended to demon- strate that perceptrons (and all other network machines) are too limited to deserve further attention. Certainly many of the best researchers turned away from network machines for quite some time, but present-day connectionists who regard that as regrettable have failed to understand the place at which they stand in history. As we said earlier, it seems to us that the effect of Perceptrons was not simply to interrupt a healthy line of research. That redirection of concern was no arbitrary diversion; it was a necessary interlude. To make further progress, connectionists would have to take time off and develop adequate ideas about the representation of knowl- edge. In the epilogue we shall explain why that was a prerequisite for understanding more complex types of network machines. In any case, the 1970s became the golden age of a new field of research into the representation of knowledge. And it was not only connectionist learning that was placed on hold; it also happened to research on learning in the field of artificial intelligence. For example, although Patrick Winston’s 1970 doctoral thesis (see “Learning Structural Definitions from Examples,” in The Psychol- ogy of Computer Vision, ed. P. H. Winston [McGraw-Hill, 1975]) was clearly a major advance, the next decade of AI research saw surprisingly little attention to that subject. In several other related fields, many researchers set aside their interest in the study of learning in favor of examining the represen- tatipn of knowledge in many different contexts and forms. The [xiv] Prologue result was the very rapid development of many new and powerful ideas — among them frames, conceptual dependency, production systems, word-expert parsers, relational databases, K-lines, scripts, nonmonotonic logic, semantic networks, analogy genera- tors, cooperative processes, and planning procedures. These ideas about the analysis of knowledge and its embodiments, in turn, had strong effects not only in the heart of artificial intelligence but also in many areas of psychology, brain science, and applied expert systems. Consequently, although not all of them recognize this, a good deal of what young researchers do today is based on what was learned about the representation of knowledge since Perceptrons first appeared. As was asserted above, their not knowing that his- tory often leads them to repeat mistakes of the past. For example, many contemporary experimenters assume that, because the per- ceptron networks discussed in this book are not exactly the same as those in use today, these theorems no longer apply. Yet, as we will show in our epilogue, most of the lessons of the theorems still apply. The 1980s: The Revival of Learning Machines What could account for the recent resurgence of interest in net- work machines? What turned the tide in the battle between the connectionists and the symbolists? Was it that symbolic AI had run out of steam? Was it the important new ideas in connectionism? Was it the prospect of new, massively parallel hardware? Or did the new interest reflect a cultural turn toward holism? Whatever the answer, a more important question is: Are there inherent incompatibilities between those connectionist and sym- bolist views? The answer to that depends on the extent to which one regards each separate connectionist scheme as a self-standing system. If one were to ask whether any particular, homogeneous network could serve as a model for a brain, the answer (we claim) would be, clearly, No. But if we consider each such network as a possible model for a part of a brain, then those two overviews are complementary. This is why we see no reason to choose sides. We expect a great many new ideas to emerge from the study of symbol-based theories and experiments. And we expect the future of network-based learning machines to be rich beyond imagining. As we say in sec- Prologue [xv] tion 0.3, the solemn experts who complained most about the “ex- aggerated claims” of the cybernetics enthusiasts were, on balance, in the wrong. It is just as clear to us today as it was 20 years ago that the marvelous abilities of the human brain must emerge from the parallel activity of vast assemblies of interconnected nerve cells. But, as we explain in our epilogue, the marvelous powers of the brain emerge not from any single, uniformly structured connec- tionist network but from highly evolved arrangements of smaller, specialized networks which are interconnected in very specific ways. The movement of research interest between the poles of connec- tionist learning and symbolic reasoning may provide a fascinating subject for the sociology of science, but workers in those fields must understand that these poles are artificial simplifications. It can be most revealing to study neural nets in their purest forms, or to do the same with elegant theories about formal reasoning. Such isolated studies often help in the disentangling of different types of mechanisms, insights, and principles. But it never makes any sense to choose either of those two views as one’s only model of the mind. Both are partial and manifestly useful views of a reality of which science is still far from a comprehensive understanding. Introduction o 0.0 Readers In writing this we had in mind three kinds of readers. First, there are many new results that will interest specialists concerned with “pattern recognition,” “learning machines,” and “threshold logic.” Second, some people will enjoy reading it as an essay in abstract mathematics; it may appeal especially to those who would like to see geometry return to topology and algebra. We ourselves share both these interests. But we would not have carried the work as far as we have, nor presented it in the way we shall, if it were not for a different, less clearly defined, set of interests. The goal of this study is to reach a deeper understanding of some concepts we believe are crucial to the general theory of computa- tion. We will study in great detail a class of computations that make decisions by weighing evidence. Certainly, this problem is of great interest in itself, but our real hope is that understanding of its mathematical structure will prepare us eventually to go further into the almost unexplored theory of parallel computers. The people we want most to speak to are interested in that general theory of computation. We hope this includes psychologists and biologists who would like to know how the brain computes thoughts and how the genetic program computes organisms. We do not pretend to give answers to such questions — nor even to propose that the simple structures we shall use should be taken as “models” for such processes. Our aim — we are not sure whether it is more modest or more ambitious — is to illustrate how such a theory might begin, and what strategies of research could lead to it. It is for this third class of readers that we have written this intro- duction. It may help those who do not have an immediate involve- ment with it to see that the theory of pattern recognition might be worth studying for other reasons. At the same time we will set out a simplified version of the theory to help readers who have not had the mathematical training that would make the later chapters easy to read. The rest of the book is self-contained and anyone who hates introductions may go directly to Chapter 1. 0.1 Real, Abstract, and Mythological Computers We know shamefully little about our computers and their compu- tations. This seems paradoxical because, physically and logically, [2] 0.1 Introduction computers are so lucidly transparent in their principles of opera- tion. Yet even a school boy can ask questions about them that today’s “computer science” cannot answer. We know very little, for instance, about how much computation a job should require. As an example, consider one of the most frequently performed computations: solving a set of linear equations. This is important in virtually every kind of scientific work. There are a variety of standard programs for it, which are composed of additions, mul- tiplications, and divisions. One would suppose that such a simple and important subject, long studied by mathematicians, would by now be thoroughly understood. But we ask, How many arithme- tic steps are absolutely required? How does this depend on the amount of computer memory? How much time can we save if we have two (or n) identical computers? Every computer scientist “knows” that this computation requires something of the order of n 3 multiplications for n equations, but even if this be true no one knows — at this writing — how to begin to prove it. Neither the outsider nor the computation specialist seems to recognize how primitive and how empirical is our present state of understanding of such matters. We do not know how much the speed of computations can be increased, in general, by using “parallel” as opposed to “serial” — or “analog” as opposed to “digital” — machines. We have no theory of the situations in which “associative” memories will justify their higher cost as compared to “addressed” memories. There is a great deal of folk- lore about this sort of contrast, but much of this folklore is mere superstition; in the cases we have studied carefully, the common beliefs turn out to be not merely “unproved”; they are often drastically wrong. The immaturity shown by our inability to answer questions of this kind is exhibited even in the language used to formulate the ques- tions. Word pairs such as “parallel” vs. “serial;” “local” vs. “global,” and “digital” vs. “analog” are used as if they referred to well-defined technical concepts. Even when this is true, the technical meaning varies from user to user and context to con- text. But usually they are treated so loosely that the species of computing machine defined by them belongs to mythology rather than science. Introduction 0.2 [3] Now we do not mean to suggest that these are mere pseudo- problems that arise from sloppy use of language. This is not a book of “therapeutic semantics”! For there is much content in these intuitive ideas and distinctions. The problem is how to capture it in a clear, sharp theory. 0.2 Mathematical Strategy We are not convinced that the time is ripe to attempt a very general theory broad enough to encompass the concepts we have mentioned and others like them. Good theories rarely develop outside the context of a background of well-understood real prob- lems and special cases. Without such a foundation, one gets either the vacuous generality of a theory with more definitions than theorems — or a mathematically elegant theory with no applica- tion to reality. Accordingly, our best course would seem to be to strive for a very thorough understanding of well-chosen particular situations in which these concepts are involved. We have chosen in fact to explore the properties of the simplest machines we could find that have a clear claim to be “parallel” — for they have no loops or feedback paths — yet can perform computations that are nontrivial, both in practical and in mathe- matical respects. Before we proceed into details, we would like to reassure non- mathematicians who might be frightened by what they have glimpsed in the pages ahead. The mathematical methods used are rather diverse, but they seldom require advanced knowledge. We explain most of that which goes beyond elementary algebra and geometry. Where this was not practical, we have marked as op- tional those sections we feel might demand from most readers more mathematical effort than is warranted by the topic’s role in the whole structure. Our theory is more like a tree with many branches than like a narrow high tower of blocks; in many cases one can skip, if trouble is encountered, to the beginning of the following chapter. The reader of most modern mathematical texts is made to work unduly hard by the authors’ tendency to cover over the intel- lectual tracks that lead to the discovery of the theorems. We have [4] 0.3 Introduction tried to leave visible the lines of progress. We should have liked to go further and leave traces of all the false tracks we followed; unfortunately there were too many! Nevertheless we have oc- casionally left an earlier proof even when we later found a “better” one. Our aim is not so much to prove theorems as to give insight into methods and to encourage research. We hope this will be read not as a chain of logical deductions but as a mathematical novel where characters appear, reappear, and develop. 0.3 Cybernetics and Romanticism The machines we will study are abstract versions of a class of devices known under various names; we have agreed to use the name “perceptron” in recognition of the pioneer work of Frank Rosenblatt. Perceptrons make decisions — determine whether or not an event fits a certain “pattern” — by adding up evidence obtained from many small experiments. This clear and simple concept is important because most, and perhaps all, more com- plicated machines for making decisions share a little of this char- acter. Until we understand it very thoroughly, we can expect to have trouble with more advanced ideas. In fact, we feel that the critical advances in many branches of science and mathematics began with good formulations of the “linear” systems, and these machines are our candidate for beginning the study of “parallel machines” in general. Our discussion will include some rather sharp criticisms of earlier work in this area. Perceptrons have been widely publicized as “pattern recog- nition” or “learning” machines and as such have been discussed in a large number of books, journal articles, and voluminous “reports.” Most of this writing (some exceptions are mentioned in our bibliography) is without scientific value and we will not usually refer by name to the works we criticize. The sciences of computation and cybernetics began, and it seems quite rightly so, with a certain flourish of romanticism. They were laden with attractive and exciting new ideas which have already borne rich fruit. Heavy demands of rigor and caution could have held this development to a much slower pace; only the future could tell which directions were to be the best. We feel, in fact, that the solemn experts who most complained about the “exaggerated claims” of the cybernetic enthusiasts were, in the balance, much more in the wrong. But now the time has come for maturity, and this requires us to match our speculative enterprise with equally imaginative standards of criticism. Introduction 0.5 [5] 0.4 Parallel Computation The simplest concept of parallel computation is represented by the diagram in Figure 0.1. The figure shows how one might com- pute a function \p(X) in two stages. First we compute inde- pendently of one another a set of functions \(X), <p 2 (X), <Pn{X) and then combine the results by means of a function 12 of n arguments to obtain the value of \p. To make the definition meaningful — or, rather, productive — one needs to place some restrictions on the function 12 and the set <f> of functions <p\, <p 2 , .... If we do not make restrictions, we do not get a theory: any computation \p could be represented as a parallel computation in various trivial ways, for example, by making one of the (p's be \p and letting 12 do nothing but transmit its result. We will consider a variety of restrictions, but first we will give a few concrete examples of the kinds of functions we might want to be. 0.5 Some Geometric Patterns; Predicates Let R be the ordinary two-dimensional Euclidean plane and let X be a geometric figure drawn on R. X could be a circle, or a pair of circles, or a black-and-white sketch of a face. In general we will think of a figure X as simply a subset of the points of R (that is, the black points). [6] 0.5 Introduction Let \f/(X) be a function (of figures X on R) that can have but two values. We usually think of the two values of ^ as 0 and 1. But by taking them to be false and true we can think of \p(X) as a predicate, that is, a variable statement whose truth or falsity de- pends on the choice of X. We now give a few examples of predi- cates that will be of particular interest in the sequel. CIRCLE (-^0 — 1 if the figure X is a circle, 0 if the figure is not a circle; <Ac <(X) 1 if X is a convex figure, 0 if X is not a convex figure; * CONNECTED 1 if X is a connected figure, 0 otherwise. Introduction 0.6 [7] We will also use some very much simpler predicates.* The very simplest predicate “recognizes” when a particular single point is in X: let p be a point in the plane and define <P P (X) 1 if p is in X , 0 otherwise. Finally we will need the kind of predicate that tells when a par- ticular set A is a subset of X: <Pa(X) = 1 if A C X, 0 otherwise. 0.6 One simple concept of “Local” We start by observing an important difference between ^connected and ^convex- To bring it out we state a fact about convexity: Definition: A set X fails to be convex if and only if there exist three points such that q is in the line segment joining p and /*, and { p is in X , q is not in X , r is in X. Thus we can test for convexity by examining triplets of points. If all the triplets pass the test then X is convex; if any triplet fails (that is, meets all conditions above) then X is not convex. Be- cause all the tests can be done independently, and the final decision made by such a logically simple procedure— unanimity of all the tests — we propose this as a first draft of our definition of “local.” *We will use ‘V” instead of for those very simple predicates that will be combined later to make more complicated predicates. No absolute logical distinc- tion is implied. [8] 0.6 Introduction Definition: A predicate \p is conjunctively local of order k if it can be computed, as in §0.4, by a set 4> of predicates p such that f Each <p depends upon no more than k points of R\ WX) UlU' Example: ^convex ls conjunctively local of order 3. 1 if <p(X) = 1 for every p in 4>, 0 otherwise. The property of a figure being connected might not seem at first to be very different in kind from the property of being convex. Yet we can show that: Theorem 0.6.1: \p connected ls not conjunctively local of any order. proof: Suppose that \[/ CONNECTED has order k. iUese 4 * too - u>>( Then to distinguish and To X\ suck -HiaE %(Xo) = O i because there must be some p o is not con- nected. All p s have value 1 on J t , which is connected. Now, p can depend on at most k points, so there must be at least one middle square, say that does not contain one of these points. But then, on the figure X 2 , H m W/ W/ H n n n p m H which is connected, <^must have the same value, 0, that it has on X 0 . But this cannot be, for all ^’s must have value 1 on X 2 . Of course, if some <p is allowed to look at all the points of R then ^connected can t> e computed, but this would go against any con- cept of the tp's as “local'’ functions. Introduction 0.7 [9] 0.7 Some Other Concepts of Local We have accumulated some evidence in favor of “conjunctively local” as a geometrical and computationally meaningful property of predicates. But a closer look raises doubts about whether it is broad enough to lead to a rich enough theory. Readers acquainted with the mathematical methods of topology will have observed that “conjunctively local” is similar to the notion of “local property” in topology. However, if we were to pursue the analogy, we would restrict the < p’s to depend upon all the points inside small circles rather than upon fixed numbers of points. Accordingly, we will follow two parallel paths. One is based on restrictions on numbers of points and in this case we shall talk of predicates of limited order. The other is based on restric- tions of distances between the points, and here we shall talk of diameter-limited predicates. Despite the analogy with other im- portant situations, the concept of local based on diameter limita- tions seems to be less interesting in our theory — although one might have expected quite the opposite. More serious doubts arise from the narrowness of the “conjunc- tive” or “unanimity” requirement. As a next step toward ex- tending our concept of local , let us now try to separate essential from arbitrary features of the definition of conjunctive localness. The intention of the definition was to divide the computation of a predicate \p into two stages: Stage I: The computation of many properties or features p a which are each easy to compute, either because each depends only on a small part of the whole input space R, or because they are very simple in some other interesting way. Stage II: A decision algorithm Q that defines f by combining the results of the Stage I computations. For the division into two stages to be mean- ingful, this decision function must also be distinctively homogeneous, or easy to program, or easy to compute. The particular way this intention was realized in our example \ p convex was rather arbitrary. In Stage I we made sure that the s were easy to compute by requiring each to depend only upon a few points of R. In Stage II we used just about the simplest im- [10] 0.7 Introduction aginable decision rule; if the <p’s are unaminous we accept the figure; we reject it if even a single <p disagrees. We would prefer to be able to present a perfectly precise defini- tion of our intuitive local-vs.-global concept. One trouble is that phrases like “easy-to-compute” keep recurring in our attempt to formulate it. To make this precise would require some scheme for comparing the complexity of different computation procedures. Until we find an intuitively satisfactory scheme for this, and it doesn’t seem to be around the corner, the requirements of both Stage I and Stage II will retain the heuristic character that makes formal definition difficult. From this point on, we will concentrate our attention on a partic- ular scheme for Stage II — “weighted voting,” or “linear combina- tion” of the predicates of Stage I. This is the so-called perceptron scheme, and we proceed next to give our final definition. 0.8 Perceptrons Let<f> = \<p i, <p 2 , . . . , <p n \ be a family of predicates. We will say that \p is linear with respect to if there exists a number 6 and a set of numbers \a <f>v a tf>r . . . , a^J such that \p(X) = 1 if and only if a <Pl <p\(X) + • • • + a tf>n (p n (X) > 0. The number 6 is called the threshold and the a’s are called the co- efficients or weights . (See Figure 0.2). We usually write more com- pactly yp{X) = 1 if and only if ^ a {p ^(X) > 6. Figure 0.2 Introduction 0.8 [11] The intuitive idea is that each predicate of is supposed to pro- vide some evidence about whether \[/ is true for any figure X. If, on the whole, yp(X) is strongly correlated with <p(X) one expects a ^ to be positive, while if the correlation is negative so would be a The idea of correlation should not be taken literally here, but only as a suggestive analogy. Example: Any conjunctively local predicate can be expressed in this form by choosing 6 = - 1 and = - 1 for every ip. For then Or one. wvT+e (See £(-1M*)>-i £<p(x)--o , o, exactly when <p(X) = 0 for every <p in <i>. (The senses of true and false thus have to be reversed for the <£>’ s, but this isn’t im- portant.) Example: Consider the seesaw of Figure 0.3 and let X be an ar- rangement of pebbles placed at some of the equally spaced points \p u . . . , p ! }. Then R has seven points. Define <pi(X) = 1 if and only if X contains a pebble at the zth point. Then we can express the predicate “The seesaw will tip to the right” by the formula Z(» - 4 )**U0 > 0. where 6 = 0 and a, = (z — 4). Figure 0.3 There are a number of problems concerning the possibility of infinite sums and such matters when we apply this concept to recognizing pat- terns in the Euclidean plane. These issues are discussed extensively in the text, and we want here only to reassure the mathematician that the problem will be faced. Except when there is a good technical reason to use infinite sums (and this is sometimes the case) we will make the problem finite by two general methods. One is to treat the retina R as [12] 0.8 Introduction made up of discrete little squares (instead of points) and treat as equiva- lent figures that intersect the same squares. The other is to consider only bounded X's and choose 4> so that for any bounded X only a finite number of <^’s will be nonzero. Definition: A perceptron is a device capable of computing all predicates which are linear in some given set <F of partial predi- cates. That is, we are given a set of <£>’s, but can select freely their “weights,” the aj s, and also the threshold 6. For reasons that will become clear as we proceed, there is little to say about all perceptrons in general. But, by imposing certain conditions and restrictions we will find much to say about certain particularly interesting families of perceptrons. Among these families are 1. Diameter-limited perceptrons : for each cp in <T, the set of points upon which cp depends is restricted not to exceed a certain fixed diameter in the plane. 2. Order-restricted perceptrons : we say that a perceptron has order < n if no member of <f> depends on more than n points. 3. Gamba perceptrons : each member of <f> may depend on all the points but must be a “linear threshold function” (that is, each member of T> is itself computed by a perceptron of order 1, as defined in 2 above). 4. Random perceptrons : These are the form most extensively studied by Rosenblatt’s group: the y s are random Boolean func- tions. That is to say, they are order-restricted and <f> is generated by a stochastic process according to an assigned distribution func- tion. 5. Bounded perceptrons : $ contains an infinite number of y s, but all the a ^ lie in a finite set of numbers. fo give a preview of the kind of results we will obtain, we present here a simple example of a theorem about diameter-restricted per- ceptrons. Theorem 0.8: No diameter-limited perceptron can determine whether or not all the parts of any geometric figure are connected to one another! That is, no such perceptron computes i/' connected- Introduction 0.8 [13] The proof requires us to consider just four figures ^00 Xoi X 10 X i) and a diameter-limited perceptron \p whose support sets have diameters like those indicated by the circles below: It is understood that the diameter in question is given at the start, and we then choose the XJs to be several diameters in length. Suppose that such a perceptron could distinguish disconnected figures (like Xqq and X n ) from connected figures (like X and X 0 \), according to whether or not ^ a in ip > 6 that is, according to whether or not Z group 1 a v <P (X) + ^ a (p (p(X) + ^ a <p (p(X) group 2 group 3 > 0 where we have grouped then’s according to whether their support sets lie near the left, right, or neither end of the figures. Then for Xqo the total sum must be negative. In changing to X l0 only 2 g^up i is affected, and its value must increase enough to make the [14] 0.8 Introduction total sum become positive. If we were instead to change Xqo to Xo\ then 2 gr0U p2 would have to increase. But if we were to change Xoo to X u , both 2 groU p i and 2 groU p 2 will have to increase by these same amounts since (locally!) the same changes are seen by the group 1 and group 2 predicates, while 2 grou p 3 is unchanged in every case. Hence, net change in the A"oo * X n case must be even more positive, so that if the perceptron is to make the correct decision for X 00, X 0i , and X i0 , it is forced to accept Zn as con- nected, and this is an error! So no such perceptron can exist. Readers already familiar with perceptrons will note that this proof — which shows that diameter-limited perceptrons cannot recognize con- nectedness — is concerned neither with “learning” nor with probability theory (or even with the geometry of hyperplanes in ^-dimensional hyper- space). It is entirely a matter of relating the geometry of the patterns to the algebra of weighted predicates. Readers concerned with physiology will note that— insofar as the presently identified functions of receptor cells are all diameter-limited— this suggests that an animal will require more than neurosynaptic “summation” effects to make these cells com- pute connectedness. Indeed, only the most advanced animals can appre- hend this complicated visual concept. In Chapter 5 this theorem is shown to extend also to order-limited perceptrons. 0.9 Seductive Aspects of Perceptrons The purest vision of the perceptron as a pattern-recognizing device is the following: The machine is built with a fixed set of computing elements for the partial functions tp, usually obtained by a random process. To make it recognize a particular pattern (set of input figures) one merely has to set the co- efficients to suitable values. Thus “programming” takes on a pleasingly homogeneous form. Moreover since “programs” are representable as points («i, • • • , «n) in an Az-dimensional space, they inherit a metric which makes it easy to imagine a kind of automatic programming which people have been tempted to call learning : by attaching feedback devices to the parameter controls they propose to “program” the machine by providing it with a sequence of input patterns and an “error signal” which will cause the coefficients to change in the right direction when the machine makes an inappropriate decision. The perceptron convergence theorems (see Chapter 1 1) define conditions under which this procedure is guaranteed to find, eventually, a correct set of values. 0.9.1 Homogeneous Programming and Learning To separate reality from wishful thinking, we begin by making a number of observations. Let 4> be the set of partial predicates of a perceptron and L($) the set of all predicates linear in <£. Thus Introduction 0.9 [15] L(<£) is the repertoire of the perceptron — the set of predicates it can compute when its coefficients a ^ and threshold 6 range over all possible values. Of course L(<t>) could in principle be the set of all predicates but this is impossible in practice, since $ would have to be astronomically large. So any physically real perceptron has a limited repertoire. The ease and uniformity of programming have been bought at a cost! We contend that the traditional investi- gations of perceptrons did not realistically measure this cost. In particular they neglect the following crucial points: 1. The idea of thinking of classes of geometrical objects (or pro- grams that define or recognize them) as classes of ^-dimensional vectors (a u ... , a n ) loses the geometric individuality of the patterns and leads only to a theory that can do little more than count the number of predicates in £($>)! This kind of imagery has become traditional among those who think about pattern recogni- tion along lines suggested by classical statistical theories. As a result not many people seem to have observed or suspected that there might be particular meaningful and intuitively simple predi- cates that belong to no practically realizable set £(<£). We will extend our analysis of ^connected t0 show how deep this problem can be. At the same time we will show that certain predicates which might intuitively seem to be difficult for these devices can , in fact, be recognized by low-order perceptrons: convex already illustrates this possibility. 2. Little attention has been paid to the size, or more precisely, the information content, of the parameters a j, ..., a n . We will give examples (which we believe are typical rather than excep- tional) where the ratio of the largest to the smallest of the co- efficients is meaninglessly big. Under such conditions it is of no (practical) avail that a predicate be in L(<£). In some cases the information capacity needed to store a i, ... , a n is even greater than that needed to store the whole class of figures defined by the pattern! 3. Closely related to the previous point is the problem of time of convergence in a “learning” process. Practical perceptrons are es- sentially finite-state devices (as shown in Chapter 11). It is there- fore vacuous to cite a “perceptron convergence theorem” as assurance that a learning process will eventually find a correct [16] 0.9 Introduction setting of its parameters (if one exists). For it could do so trivially by cycling through all its states, that is, by trying all coefficient assignments. The significant question is how fast the perceptron learns relative to the time taken by a completely random pro- cedure, or a completely exhaustive procedure. It will be seen that there are situations of some geometric interest for which the con- vergence time can be shown to increase even faster than ex- ponentially with the size of the set R. Perceptron theorists are not alone in neglecting these precautions. A perusal of any typical collection of papers on “self-organizing” systems will provide a generous sample of discussions of “learn- ing” or “adaptive” machines that lack even the degree of rigor and formal definition to be found in the literature on perceptrons. The proponents of these schemes seldom provide any analysis of the range of behavior which can be learned nor do they show much awareness of the price usually paid to make some kinds of learning easy: they unwittingly restrict the device’s total range of behavior with hidden assumptions about the environment in which it is to operate. These critical remarks must not be read as suggestions that we are opposed to making machines that can “learn.” Exactly the con- trary! But we do believe that significant learning at a significant rate presupposes some significant prior structure. Simple learning schemes based on adjusting coefficients can indeed be practical and valuable when the partial functions are reasonably matched to the task, as they are in Samuel’s checker player. A perceptron whose $ s are properly designed for a discrimination known to be of suitably low order will have a good chance to improve its performance adaptively. Our purpose is to explain why there is little chance of much good coming from giving a high-order prob- lem to a quasi-universal perceptron whose partial functions have not been chosen with any particular task in mind. It may be argued that people are universal learning machines and so a counterexample to this thesis. But our brains are sufficiently structured to be programmable in a much more general sense than the perceptron and our culture is sufficiently structured to provide, if not actual pro- gram, at least a rather complex set of interactions that govern the course of whatever the process of self-programming may be. Moreover, it takes time for us to become universal learners: the sequence of transitions from infancy to intellectual maturity seems rather a confirmation of the Introduction 0.9 [17] thesis that the rate of acquisition of new cognitive structure (that is, learning) is a sensitive function of the level of existing cognitive structure. 0.9.2 Parallel Computation The perceptron was conceived as a parallel-operation device in the physical sense that the partial predicates are computed simul- taneously. (From a formal point of view the important aspect is that they are computed independently of one another.) The price paid for this is that all the <p, must be computed, although only a minute fraction of them may in fact be relevant to any partic- ular final decision. The total amount of computation may become vastly greater than that which would have to be carried out in a well planned sequential process (using the same (p's) whose decisions about what next to compute are conditional on the out- come of earlier computation. Thus the choice between parallel and serial methods in any particular situation must be based on balancing the increased value of reducing the (total elapsed) time against the cost of the additional computation involved. Even low-order predicates may require large amounts of wasteful com- putation of information which would be irrelevant to a serial process. This cost may sometimes remain within physically realizable bounds, especially if a large tolerance (or “blur”) is acceptable. High-order predicates usually create a completely different situation. An instructive example is provided by \^ CONNECTED . As shown in Chapter 5, any per- ceptron for this predicate on a 100 x 100 toroidal retina needs partial functions that each look at many hundreds of points! In this case the concept of “local” function is almost irrelevant: the partial functions are themselves global. Moreover, the fantastic number of possible partial functions with such large supports sheds gloom on any hope that a modestly sized, randomly generated set of them would be sufficiently dense to span the appropriate space of functions. To make this point sharper we shall show that for certain predicates and classes of partial functions, the number of partial functions that have to be used (to say nothing of the size of their coefficients) would exceed physically realiz- able limits. The conclusion to be drawn is that the appraisal of any particular scheme of parallel computation cannot be undertaken rationally without tools to determine the extent to which the problems to be solved can be analyzed into local and global components. The lack of a general theory of what is global and what is local is no excuse for avoiding the problem in particular cases. This study will show that it is not impossibly difficult to develop such a theory for a limited but important class of problems. [18] 0.9 Introduction 0.9.3 The Use of Simple Analogue Devices Part of the attraction of the perceptron lies in the possibility of using very simple physical devices — “analogue computers” — to evaluate the linear threshold functions. It is perhaps generally appreciated that the utility of this scheme is limited by the sparse- ness of linear threshold functions in the set of all logical functions. However, almost no attention has been paid to the possibility that the set of linear functions which are practically realizable may- be rarer still. To illustrate this problem we shall compute (in Chapter 10) the range and sizes of the coefficients in the linear representations of certain predicates. It will be seen that certain ratios can increase faster than exponentially with the number of distinguishable points in R. It follows that for “big” input sets— say, R's with more than 20 points— no simple analogue storage device can be made with enough information capacity to store the whole range of coefficients! To avoid misunderstanding perhaps we should repeat the quali- fications we made in connection with our critique of the percep- tron as a model for “learning devices.” We have no doubt that analogue devices of this sort have a role to play in pattern recognition. But we do not see that any good can come of experi- ments which pay no attention to limiting factors that will assert themselves as soon as the small model is scaled up to a usable size. 0.9.4 Models for Brain Function and Gestalt Psychology The popularity of the perceptron as a model for an intelligent, general-purpose learning machine has roots, we think, in an image of the brain itself as a rather loosely organized, randomly interconnected network of relatively simple devices. This impres- sion in turn derives in part from our first impressions of the be- wildering structures seen in the microscopic anatomy of the brain (and probably also derives from our still-chaotic ideas about psychological mechanisms). In any case the image is that of a network of relatively simple elements, randomly connected to one another, with provision for making adjustments of the ease with which signals can go across the connections. When the machine does something bad, we will “teach” it not to do it again by weakening the connections that were involved; perhaps we will do the opposite to reward it when it does something we like. Introduction 0.9 [19] The “perceptron” type of machine is one particularly simple version of this broader concept; several others have also been studied in experiments. The mystique surrounding such machines is based in part on the idea that when such a machine learns the information stored is not localized in any particular spot but is, instead, “distributed throughout” the structure of the machine’s network. It was a great disappointment, in the first half of the twentieth century, that experiments did not support nineteenth century concepts of the localization of memories (or most other “faculties”) in highly local brain areas. Whatever the precise interpretation of those not particularly conclusive experiments should be, there is no ques- tion but that they did lead to a search for nonlocal machine- function concepts. This search was not notably successful. Several schemes were proposed, based upon large-scale fields, or upon “interference patterns” in global oscillatory waves, but these never led to plausible theories. (Toward the end of that era a more intricate and substantially less global concept of “cell-assembly” — proposed by D. O. Hebb [1949] — lent itself to more productive theorizing; though it has not yet led to any conclusive model, its popularity is today very much on the increase.) However, it is not our goal here to evaluate these theories, but only to sketch a picture of the intellectual stage that was set for the perceptron concept. In this setting, Rosenblatt’s [1958] schemes quickly took root, and soon there were perhaps as many as a hundred groups, large and small, experimenting with the model either as a “learn- ing machine” or in the guise of “adaptive” or “self-organizing” networks or “automatic control” systems. The results of these hundreds of projects and experiments were generally disappointing, and the explanations inconclusive. The machines usually work quite well on very simple problems but deteriorate very rapidly as the tasks assigned to them get harder. The situation isn’t usually improved much by increasing the size and running time of the system. It was our suspicion that even in those instances where some success was apparent, it was usually due more to some relatively small part of the network, and not really to a global, distributed activity. Both of the present authors (first independently and later together) became involved with a somewhat therapeutic compulsion: to dispel what we feared to be [20] 0.9 Introduction the first shadows of a “holistic” or “Gestalt” misconception that would threaten to haunt the fields of engineering and artificial intelligence as it had earlier haunted biology and psychology. For this, and for a variety of more practical and theoretical goals, we set out to find something about the range and limitations of perceptrons. It was only later, as the theory developed, that we realized that understanding this kind of machine was important whether or not the system has practical applications in particular situations! For the same kinds of problems were becoming serious obstacles to the progress of computer science itself. As we have already re- marked, we do not know enough about what makes some algo- rithmic procedures “essentially” serial, and to what extent — or rather, at what cost — can computations be speeded up by using multiple, overlapping computations on larger more active memories. 0.10 General Plan of the Book The theory divides naturally into three parts. In Part I we explore some very general properties of linear predicate families. The theorems in Part I apply usually to all perceptrons, independently of the kinds of patterns considered; therefore the theory has the quality of algebra rather than geometry. In Part II we look more narrowly at interesting geometric patterns, and get sharper but, of course, less general, theorems about the geometric abilities of our machines. In Part III we examine a variety of questions centered around the potentialities of perceptrons as practical devices for pattern recognition and learning. The final chapter traces some of the history of these ideas and proposes some plausible directions for further exploration. \o reaxl +^5 loooW cl oes voo'f WavJe a (,[ l/>ia4[iewGtf(£5 •'* , 4'- ,, kairdetr , ‘ e Waitcct [ sec^toiAs a ire " vn >^( 1/ — ■ cq ia >oe i-t/t'f’bovf’ lost in cj fine l The locvs~f etna, p 'fries 4c? sfv'p ^ , $5 > § 7 1 <W $ 10 . ALGEBRAIC THEORY OF LINEAR PARALLEL PREDICATES I [22] Algebraic Theory of Linear Parallel Predicates Introduction to Part I Part I (Chapters 1-4) contains a series of purely algebraic defini- tions and general theorems used later in Part II. It will be easier to read through this material if one has already a preliminary picture of the roles these mathematical devices are destined to play. We can give such a picture by outlining how we will prove the follow- ing theorem: We c\o not expect the rectJev vecfl^ absorb con 6?v\se A S^nofsis. 4 Theorem 3.1 (Chapter 3) Informal Version: Suppose the retina R has a finite number of points. Then there is no perceptron Za (f (p(X) > 6 that can decide whether or not the “number of points in X is odd” unless at least one of the <^'s depends on all the points of R . Thus no bound can be placed on the orders of perceptrons that compute this predicate for arbitrarily large retinas. To realize it a perceptron has to have at the start at least one <p that looks at the whole picture! The proof uses several steps: Step 1 : In §1.1 -§1.4, we define “perceptron,” “order,” etc., more precisely, and show that certain details of the definitions can be changed without serious effects. Step 2: In §1.3 we define the particularly simple functions called “masks.” For each subset A of the retina define the mask <p A {X) to have value 1 if the figure X contains or “covers” all of A , value 0 otherwise. Then we prove the simple but important theorem (§1.5) that if a predicate has order < k (see §1.3) in any set of ip functions, then there is an equivalent perceptron that uses only masks of size < k t Step 3: To get at the parity — the “odd-even” property — we ask: What rearrangements of the input space R leaves it unaffected ? That is, we ask about the group of transformations of the figure that have no effect on the property. This might seem to be an exotic way to approach the problem, but since it seems necessary for the more difficult problems we attack later, it is good first to get used to it in this simple situation. In this case, the group is the whole permutation group on R — the set of all rearrangements of its points. Step 4\ In Chapter 2 we show how to use this group to reduce the perceptron to a simple form. The group-invariance theorem Introduction to Part I [23] (proved in §2.2) is used to show that, for the parity perceptron, all masks with the same support size — that is, all those that look at the same number of points — can be given identical coefficients. Let f3j be the weight assigned to all masks that have support size = j. Group-invariant coefficients for \R \ = 3 parity predicate. Step 5: It is then shown (in §3.1) that the parity perceptron can be written in the form , , , v'J-l I*. vatTj one can vuse (2>^ - (-3.J ■A (\X\\ Wo 5*1 culler yu/vnbe /5 do. o ) > °’ See ^10.0.. where | A'l is the number of points in X, k is the largest support size, and ( f 1 ) is the number of subsets of X that have j elements. Step 6: Because J j • ( n - j)'- j'- ■ (n + 1 - 1) • (n + 1 - 2) ■ ■(«+ 1 - j) [24] Algebraic Theory of Linear Parallel Predicates is a product of j linear terms, it is a polynomial of degree j in n. Therefore we can write our predicate in the form P k ( \X\) > 0, where P k is a polynomial in \X\ of algebraic degree < k. Now if \X\ is an odd number, P^(|A"|) > 0 while if \X\ is even, P k ( | X | ) < 0. Therefore, in the range 0 < \X\ < j/? |, Pk must change its direction \R\ - 1 times. But a polynomial must have degree > \R\ to do that, so we conclude that k > |/?|. This completes the proof exactly as it will be done in §3.1. This shows how the algebra works into our theory. For some of the more difficult connectedness theorems of Chapter 5, we need somewhat more algebra and group theory. In Chapter 4 we push the ideas about the geometry of algebraic degrees a little further to show that some surprisingly simple predicates require un- bounded-order perceptrons. But the results of Chapter 4 are not really used later, and the chapter could be skipped on first read- ing. To see some simpler, but still characteristic results, the reader might turn directly to Chapter 8, which is almost self-contained because it does not need the algebraic theory. Theory of Linear Boolean Inequalities l 1.0 In this chapter we introduce the theory of the linear representa- tion of predicates. We will talk about properties of functions defined on an abstract set of points, without any additional mathematical or geometrical structure. Thus this chapter can be regarded as an extension of ordinary Boolean algebra. Later, the theorems proved here will be applied to sets with particular geo- metrical or topological structures, such as the ordinary Euclidean plane. So, we begin by talking about sets in general; only later will we deal with properties of familiar things like “triangles.” We shall begin with predicates defined for a fixed base space R. In §1 . 1— §1 .5, whenever we speak of a predicate we assume an R is already chosen. Later on, we will be interested in “predicates” defined more broadly, either entirely independent of any base space, or on any one of some large family of spaces. For example, the predicate The set X is nonempty can be applied to any space R. The predicate The set X is connected is meaningful when applied to many different spaces in which there is a concept of points being near one another. In §1.6 we will introduce the term “predicate scheme” for this more general sense of “predicate.” Our main goal is to define the general notion of order of a predicate (§1.5) and the notion of finite order of a predicate-scheme (§1.6). In later chapters we will use the term predicate loosely to refer also to predicate-schemes, and in §1.7 there are some remarks on making these definitions more precise and formal. But we do not recommend readers to worry about this until after the main results are understood intuitively. 1.1 Definitions The letter R will denote an arbitrary set of points. We will usually use the lower case letters a, b, c, . . . , x, y, z for individual points of R and the upper case A, B, C, . . . , X, Y, Z for subsets of R. Usually “x” and “A”’ will be used for “variable” points and sub- sets. We will often be interested in particular “families” of subsets, and will use small upper-case words for them. Thus circle is the set of subsets of R that form complete circles (as in §0.5). For an abstract family of subsets we will use the bold-face F. [26] 1.1 Algebraic Theory of Linear Parallel Predicates It is natural to associate with any family F of sets a predicate \p F (X) which is true if and only if X is in the family F. For ex- ample \p convex ('^0 is true or false according to whether X is or is not a convex set. Of course, ^circle an d ’/'convex are meaningless except on nonabstract R's to which these geometric ideas can be applied. The Greek letters tp and \f/ will always represent predi- cates. \p will usually denote the predicate of main interest, while (p predicates are usually in a large family of easily computed func- tions; the symbol will refer to that family. A predicate is a function (of subsets of R) that has two possible values. Sometimes we think of these two values as “true” and “false”; other times it is very useful to think of them as “1” and “0.” Because there is occasionally some danger of confusing these two kinds of predicate values, we have introduced the notation \\p(X)] to avoid ambiguity. The corners always mean that the “1” and “0” values are to be used. This makes it possible to use the values of predicates as ordinary numbers, and this is important in our theory since, as discussed in Chapter 0, we have to combine evidence obtained from predicates. Any mathematical statement can be used inside the corners: for example, since 3 is less than 5, and 1 is less that 2, we can write [3 < 51 = 1, [3 < 51 + [1 < 21 = 2, [3 < 51 + [5 < 31 = 1, or even [3 < [5 = 111 = 0, 4T3 < 51 + 2- [6 < 21 = 4. It will sometimes be convenient to think of the points of R as enumerated in a sequence x u x 2 , x 3l . . . , x h Then many predicates can be expressed in terms of the traditional representa- tions of Boolean algebra. For example the two expressions Xi V Xj \XitX OR XjeX] have the same meaning, namely, they have value 1 if either or both of Xi and Xj are in X, value 0 if neither is in X. Technically Theory of Linear Boolean Inequalities 1.2 [27] this means that one thinks of a subset X of R as an assignment of values 1 or 0 to the x/s according to whether the / th point is in X, so “x ” is used ambiguously both to denote the /th point and to denote the set-function [x/c X\. We can exploit this by writing predicates in arithmetic instead of logical forms, that is, fxi + x 2 + x 3 > 0] instead of X\ V x 2 V x? or even [ 2 x ^2 - X\ - x 2 > -11 instead ofxi = x 2 , where X! = x 2 is the predicate that is true if both, or neither, but not just one of x\ and x 2 , are in X. We will need to be able to express the idea that a function may “depend” only on a certain subset of R. We denote by S(<p) the subset of R upon which v “really depends”: technically, S(cp) is the smallest subset S of R such that, for every subset X of R , <p(x) = <p(x n s), where “ X H S” is the intersection, that is, the set of points that are in both, of X and S. We call S(<p) the support of <p. For an infinite space /?, some interesting predicates will have S((p) unde- fined. Consider for example if{X) = [^contains an infinite set of pointsl. One could determine whether <p(X) is true by examining the points of X that lie in any set S that contains all but a finite number of points of R. And there is no “smallest” such set! 1.2 Functions Linear with respect to a Class of Predicates Let $ be a family of predicates defined on R. We say that yp is a linear threshold function with respect to <L, if there exists a number 6 and a set of numbers a(^), one for each <p in <F, such that T a(<f) ■ <p(X) > 9 WX) = [28] 1.2 Algebraic Theory of Linear Parallel Predicates That is, \p(X) is true exactly when the inequality within the f l’s is true. We often write this less formally as \p = \2a((p)<p > 6], or even as \p = > 01. For symmetry, we want to include with its negation J(X) = [2 a(<p)<p < 6\ in the class of linear threshold functions. For a given #, we use L(<L) to denote the set of all predicates that can be defined in this way — that is, by choosing different numbers for 6 and for the as. V* For a two-point space R = {x,y), the class L({jc, >>]) of functions linear in the two one-point predicates includes 14 of the 16 = 2 2 * possible Boolean functions. For larger numbers of points, the fraction of func- tions linear in the one-point predicates decreases very rapidly toward zero. Theory of Linear Boolean Inequalities 1.2 [29] 1.2.1 Other possible definitions of L(<f>) Because the definition of L(<£) is so central to what follows, it is worth examining it to see which features are essential and which are arbitrary. The following proposition will mention a number of ways the definition could be changed without significantly altering its character. In fact, for finite R , the most important case, all the proposed alternatives lead to strictly equivalent definitions. In the case of infinite R-spaces, some of them lead to different meanings for L(<J>) but never in a way that will affect any of our subsequent discussions. Proposition: The following modifications in the formal definition of L(<f>) result in defining the same classes of predicates. (1) If $ is assumed to contain the constant function, I(X) = 1, then 0 can be taken to be zero. (2) The inequality sign “>” can be replaced by “<,” “>,” or “<.” (3) If R is finite then all the a(p)’s, and 0, can be confined to integer values. (4) All the alternatives in 1-3 can be chosen independently. These assertions are all obviously true; the following proofs are intended mainly to help readers who would like some practice in using our notations. proof of (1): Define a '( I ) = a ( I ) — 0 and otherwise a '(( p ) = a((p). Then \2a(<p)<p(X) > 0} = f2a'(< p)<p(X) > 01. PROOF of (2): Let a '(< p ) = - a(»and0' = - 0 . Then \2a(<p)<p < 0} = 12 «'(*>)* > 0'1. The other replacements follow by exchanging all predicates and their negations. proof of (3): If R is finite then 4> is finite and we can assume that there is no X for which 2 a(<p)<p(X) = 6. For, if there is such an X we can remedy this by changing 6 to 0 yd, where S is less than the smallest nonzero value of 1 2a(<p)(p(X) -6 |. Suppose first that all the a(p)'s are [30] 1.2 Algebraic Theory of Linear Parallel Predicates rational. Let D be the product of all their denominators and define a'((p) = Da((p) and 6 f = DO. Then the a'(<pY s are all integers and clearly \2a(<p)<p(X) > 6} = \-Za\ip)ip{X) > O'] for all X. Now suppose that some members of { a(ip)} are irra- tional. Then replace each a(<p) by some rational number in the interval <5 a(<p) < a’(<p) < a((f) + where 5 is as defined above. This replacement cannot change the sum 2a(<^) <p(X) by as much as <5, so it can never affect the value off 2a(<p) <p(X) > 0]. For there are at most 2 2 different (p's. 1.3 The Concept of Order Predicates whose supports are very small are too local to be interesting in themselves. Our main interest is in predicates whose support is the whole of R, but which can be represented as linear threshold combinations of predicates with small support. A simple example is \p(X) = IX is not emptyl. Clearly S(\p) = R . On the other hand if we let <L be the set of predicates of the form (p p (X ) = Ip e X] we have J |S(<^)| = 1 for all (f in <£, \ WX) = \2<P P {X) > 01 . These two statements allow us to say that the order of \f/ is 1. In general, the order of \[/ is the smallest number k for which we can find a set of predicates satisfying |S(<p)| < k for all (p in Theory of Linear Boolean Inequalities 1.4 [31] It should be carefully observed that the order of ^ is a property of \[/ alone, and not relative to any particular <£. This is what makes it an important “absolute” concept. Those who know the appropriate literature will recognize “order 1” as what are usually called “ linear threshold functions .” 1.4 Masks and other Examples of Linear Representation A very special role will be played by predicates of the form < p A (X ) = [all members of A are in X] = \A cn In the conventional Boolean point-function notations, these predicates appear as conjunctions: if A = \y { , ... , y n } then Pa(X) = y\ A y 2 A . . . A y n or, as it is usually written, <p A (X) = T1T2...JV We shall call cp A the mask of the set A. In particular the constant predicate I{X) is the mask of the empty set; and the predicates (p p in the previous paragraph are the masks of the one-point sets. %imk = 1 Proposition: All masks are of order 1. proof: Let A be any finite set. It contains | A | points. For each point x e A define <p x (X) to be \x e X}. Then <Pa(X) = X vx(X) > Ml Example 1: Of the 16 Boolean functions of two variables, all have order 1 except for exclusive-or, x 0 y, and its complement ident- ity , x = y, which are of order 2: x ® y = \xy + xy > 01, x = y = [xy + xy > 01, [32] 1.4 Algebraic Theory of Linear Parallel Predicates where, for example, “xy” is the predicate of support = 2 which is true only when x is in X and y is not in X. ( Problem : prove that x ® y is not order 1 !) Other examples from Boolean algebra are x D y = \x V y] - \y - x > -11 ~ x = \ — x > - 11. ( stereoscopic) One can think of a linear inequality as defining a surface that separates points into two classes, thus defining a predicate. We do not recommend this kind of imagery until Part III. Any mask has order 1: x A y A z = \x + y + z > 2] as does any disjunction xVyVz = \x + y + z>0]. Example 2: X\ = x 2 can be represented as a linear combination of masks by V x\x 2 = \x { x 2 + (1 - x,)(l - x 2 ) > 01 = \lx x x 2 - X\ - x 2 > - 11. A proof that “exclusive-or” and equivalence are not of order 1 will be found in §2. 1 . Example 3: Let M be an integer 0 < M < \R | . Then the “count- ing predicate” \p M , or \\X\ = M ], which recognizes that X con- tains exactly M points , is of order 2. Theory of Linear Boolean Inequalities 1.5 [33] proof: Consider the representation (2 M -1)2 x, + (-2) 2 x,xj > M 1 all/ i<j For any figure X there will be \X\ terms x ( with value 1, and \ \X\ • (\X\ - 1) terms x t Xj with value 1. Then the predicate is [(2 M - 1) • \X\ - \X\ - ( \X\ -1) - M 2 > 0] = \(\X\ - M) 2 < 01 and the only value of \X\ for which this is true is \X\ = M. Ob- serve that, by increasing the threshold we can modify the predi- cate to accept counts within an arbitrary interval instead of a single value. We have shown that the order is not greater than 2; Theorem 2.4 will confirm that it is not order 1. Note that the linear form for the counting predicate does not contain \R \ explicitly. Hence it works as well as for an infinite space R. Example 4: The predicates [ \X\ > M] and f \X\ < M] are of order 1 because they are represented by [2x, > Ml and ISx/ < Ml. 1.5 The “Positive Normal Form Theorem” The order of a function can be determined by examining its repre- sentation as a linear threshold function with respect to the set of [34] 1.5 Algebraic Theory of Linear Parallel Predicates masks (Theorem 1 .5.3). To prove this we first show Theorem 1.5.1: Every \f/ is a linear threshold function with respect to the set of all masks, that is, \[/ e L(all masks) . proof: Any Boolean function . . . , x n ) can be written in the disjunctive normal form C x (X) V C 2 (X) v ... V C P (X), where each Cj(X ) has the form of a product (that is, a conjunc- tion) Z!Z 2 ...Z n in which each z is either an Xj or an x,-. Since at most one of the Ci(X) can be true for any X , we can rewrite \p, using the arith- metic sum C x (X) + C 2 (X) + ... + C P (X). Next, the following formula can be applied to any product con- taining a barred letter: let $ and £ be any strings of letters. $ Xj £ = $(1 - Xj)£ = $£ - $Xj£ If we continue to apply this, all the bars can be removed, without ever increasing the length of a product. When all the bars have been eliminated and like terms have been collected together we have \f/(X) = IZoLiip i(X), POSITIVE NORMAL FORM where each <£>, is a mask, and each a, is an integer. Since 'LoL i <p i (X) is 0 or 1, this is the same as WX) = f > 01. Example: \x x + x 2 + x 3 is oddl = x x -h x 2 + x 3 - 2x\x 2 - 2x 2 x 3 - 2x 3 x\ 4 - 4xix 2 x 3 . Theory of Linear Boolean Inequalities 1.5 [35] Theorem 1.5.2: The positive normal form is unique (Optional) proof: To see this let {<?,} be a set of masks and {7,} a set of numbers, none equal to zero. Choose a k for which S((p k ) is minimal, that is, there is noy 5 * k such that S((pj) C S(<Pk)- Then <Pk(S(<Pk)) = 1 , <Pj(S(<p k )) = 0 j ^ k. It follows that 'Lynpj(X) is not identically zero since it has the value yk for X = S((p k ). Now if 2cti<Pi(X) = 20, tpi(X) for all X y then 2 (a, - 0,) *>,(*) = 0 for all X. But 22 («/ - PilvAX) = 22 (a,- - ddViiX) all; It follows that all a, = 0 f . This proves the uniqueness of the coefficients of the positive normal form of \p. Note that the positive normal form always has the values 0 and 1 as ordinary arithmetic sums; i.e., without requiring the [ 1 device of interpreting the validity of an inequality as a predicate. Theorem 1.5.3: \f/ is of o rder k if and only if k is the smallest num- ber for which there exists a set <l> of masks satisfying f \S(<p)\ < k for all p in $ t * € L(*). proof: In \p = [2a,f ( > 01, each <^,can be replaced by its positive normal form. If |S(<p/) | < k, this will be true also of all the masks that appear in the positive normal form. Example: A “Boolean form” has order no higher than the degree in its disjunctive normal form. Thus ^ o! ijk x t XjX k — 2 cx ij k% ; -X j 2 a ijk XiXjX k , illustrating how the negations are removed without raising the order. This particular order-3 form appears later (§6.3) in a per- ceptron that recognizes convex figures. It is natural to wonder about the orders of predicates that are Boolean functions of other predicates. Theorem §1.5.4 gives an [36] 1.5 Algebraic Theory of Linear Parallel Predicates encouraging result: Theorem 1.5.4: If has order O x and \p 2 has order 0 2 , then 0^2 and \p\ = \p 2 have order < O x + 0 2 ■ proof: Let i/q = [ 2a,<p/ > 01 and > 01 and as- sume that the coefficients are chosen so that the inner sums never exactly equal zero. ^\ = ^2 = > 01 = fS lV (a,aj ) (fufj > 01 But |S(*/<e,)l < |S(*/)I + |S(^)|. The other conclusion follows from \yf/ x 0 ^ 2 1=1- [i/q = ^21- Example: Since ^ M (Z) = ffM > \X\ 1 = f |JT| > Mil we conclude that \p M has order < 2, which is another way to ob- tain the result of §1.4, Example 3. Question: What can be said about the orders of [i/q A ^ 2 1 and \x[/\ V \p 2 1? The answer to this question may be surprising, in view of the simple result of Theorem 1.5.4: It is shown in Chap- ter 4 that for any order n , there exists a pair of predicates \f/\ and \p 2 , both of order /, for which (i/q A ^ 2 ) and OAi V \p 2 ) have order > n. In fact, suppose that R = A U B U C where ^4, 5, and C are large disjoint subsets of R. Then = [ \X H M > \X n C| 1 and \[/ 2 = I \X Pi B\ > \X H C| 1 each have order 1 because they are represented by 2 *■• > 2 X < XjeA XjeC and E *' > X i x^B XjtC but we shall see in Chapter 4 that (\[/\ A fi) and (\p\ V \[/ 2 ) are not even of finite order in the sense about to be described in §1.6. 1.6 Predicates of Finite Order Strictly, a predicate is defined for a particular set R and it makes no formal sense to speak of the same predicate for different R's. Theory of Linear Boolean Inequalities 1.6 [37] But, as noted in §1.0, our real motivation is to learn more about “predicates” defined independently of R — for example, concern- ing the number of elements in the set X , or other geometric properties of figures in a real Euclidean plane to which X and R provide mere approximations. To be more precise we could use a phrase such as predicate scheme to refer to a general construction which defines a predicate for each of a large class of sets R. This would be too pedantic so (except in this section) we shall use “predicate” in this wider sense. Suppose we are given a predicate scheme \p which defines a predi- cate \p R for each of a family {/?} of retinas. We shall say that \[/ is of finite order , in fact of order < k, if the orders of the \f/ R are uniformly bounded by k for all R’s in the family. Two examples will make this clearer: 1. Let {jR,} be a sequence of sets with | R, | = i. For each R ( there is a predicate defined by the predicate scheme ^ PARITY (X) which asserts, for X C Ri, that “ \X\ is an odd number .” As we will see in §3.1, the order of any such \p t must be i. Thus \[/p XKlTY is not of finite order. 2. Now let be the predicate defined over /?, by the predicate scheme f ten* tfx) = r \x\ = loi. We have shown in §1.4, that \pj is of order 2 for all R t with / > 10, and it is (trivially) of order zero for R u . . . , R 9 . Thus the predi- cate scheme t^ ten is of finite order; in fact, it has order 2. In these cases one could obtain the same dichotomy by considering in- finite sets R. On an infinite retina the predicate i£ TEN (;n = r 1*1 = ioi is of finite order, in fact of order = 2, while ’/''parity {X) = [ \X\ is oddl has no order. We shall often look at problems in this way, for it is often easier to think about one machine, even of infinite size, than about an infinite family of finite machines. In Chapter 7 v/e will discuss formaliza- tion of the concept of an infinite perceptron. It should be noted, however, that the use of infinite perceptrons does not cover all cases. For example, [38] 1.6 Algebraic Theory of Linear Parallel Predicates the predicate t(X) = [ \x\ > i |/t|i is well-defined and of order 1 for any finite R. It is meaningless for in- finite /?, yet we might like to consider the corresponding predicate scheme to have finite order. Group Invariance of Boolean Inequalities 2 2.0 In this chapter we consider linear threshold inequalities that are invariant under groups of transformations of the points of the base-space R. The purpose of this, finally achieved in Part II, is to establish a connection between the geometry of R and the realiz- ability of geometric predicates by perceptrons of finite order. 2.1 Example: Coefficients Averaged Over a Symmetry As an introduction to the methods introduced in this section we first consider a simple, almost trivial, example. Our space R has just two points, x and y. We will prove that the predicate - ^=- -{ xy - Vxy j- is not of order 1. - (Th rees- the pr^dicat e- 4hat - ass e r te-tfert i at. ) One way to prove this is to deduce a con- tradiction from the hypothesis that numbers a, /?, and 0 can be found for which r* _ \p^(x,y) = xy-*h-xy = f ax + f3y > 01. [Tb * 5 assert^ Hi X 'S a.11 bloc or all white We can proceed directly by writing down the conditions on a and /?: ^ = (1,0) = 0 => a <6 (b oil + ^0 ) tMO, 1) = 0 =► /3 < 0 ^.(1, l) = l =* a + (3 > 6 ^(0,0) = 1 =* 0 > 0 In this simple case it is easy enough to deduce the contradiction, for adding the first two conditions gives us a + /? < 20 , and this, with the third implies that 0 < 20 , and this would make 0 positive, contradicting the fourth con- dition. But arguments of this sort are hard to generalize to more com- plex situations involving many variables. On the other hand the following argument, though it may be considered a little more [40] 2.1 Algebraic Theory of Linear Parallel Predicates complicated in itself, leads directly to much deeper insights. First we observe that the value of f is unchanged when we permute, that is, interchange, x and y. That is, Thus if one of the following holds, so must the other: ax 4- f3y > 6 ay + 0x > 6\ hence i (a + 0) x + \ (a + /?) y > 6 by adding the inequalities. Similarly, either of ax 4- (3y < 6 ay + /3x < 6 yields l ( a + P) x + 1 {ol + /?) y < 6. It follows that if we write y for \ (a + 0), then \p = (x,y) = \yx + yy > 6] = \y(x + y) > 0]. Thus we can construct a new linear representation for \p in which the coefficients of x and y are equal. It follows that iM*) = \y\x\ > 9], where \X\ is, as usual, the number of points in X. Now consider the three figures X 0 = {}, X { = {x}, X 2 = l^o | = 0 and y • 0 > 6 \Xi \ = 1 and 7*1 < 0, \X 2 \ = 2 and y • 2 > 6. Group Invariance of Boolean Inequalities 2.1 [41] This is clearly impossible. Thus we learn something about if/ by “averaging" its coefficients after making a permutation that leaves the predicate unchanged . (In the example 7 is the average of a and f3.) In §2.3 we shall define precisely the notion of “average” that is involved here. the shaded regions, but this requires a polynomial of second or higher degree. 2.1.1* Groups and Equivalence-classes of Predicates The generalization of the procedure of §2.1 involves introducing an arbitrary group of transformations on the base space R, and then asking what it means for a predicate \p to be unaffected by any of these transformations (just as the predicate of §2.1 was un- affected by transposing two points). It is through this idea of “invariance under a group of transformations” that we will be able to attack geometrical problems; in so doing we are adopting the mathematical viewpoint of Felix Klein: every interesting geo- metrical property is an invariant of some transformation group. A good example of a group of transformations is the set of all translations of the plane : a translation is a transformation that moves every point of the plane into another point, with every point moved the same amount in the same direction; that is, a rigid parallel shift. Figure 2.2 illustrates the effect of two transla- *This section can be skipped by readers who know the basic definitions of the theory of groups. [42] 2.1 Algebraic Theory of Linear Parallel Predicates tions, gi and g 2 , on a figure X. The picture illustrates a number of definitions and observations we want to make. 1. We define a translation to operate upon individual points, so that g i operating on the point x yields another point g x x. This “induces” a natural sense in which the g’s operate on whole figures; let us define it. If g is one of a group G of transforma- tions— abbreviated “g e G”— and if A" is a figure, that is, a subset of R , we define gX = \gx\xeX] which is read: gX is (defined to be) the set of all points gx ob- tained by applying g to points x of X. 2. If we apply to X first one transformation g x and then another transformation g 2 we obtain a new figure that could be denoted by “g 2 (g\X)” But that same figure could be regarded as obtained by a single transformation — the “composition” of g 2 and g x — and it is customary to denote this composite by “gig” and hence the new figure by “gig\X” as shown in the figure. The mathematical definition of group requires that if g\ e G and g 2 e G then their composition g 2 g { must also be in G. Incidentally, in the case of the plane-translations, it is always true that£,g 2 A" = g 2 g\X, as can be seen by completing the parallelo- gram of X, g i X, g 2 X, and g 2 giX. This is to be regarded as a coin- cidence, because it is not always true in other important geometric Group Invariance of Boolean Inequalities 2.1 [43] g'eKerdt^cl ky groups. For example, if G is the grou^ef all rotations about all points in the plane, then for the indicated g { and g 2 shown below, the points g\g 2 x and g 2 g\X are quite different. Figure 2.3 Here g \ is a small rotation about p iy and g 2 is a 90° rotation about p 2 . The figure shows why, for the rotation group, we usually find ihdiig\g 2 x * gig\x. 3. The final requirement of the technical definition of “group of transformations” is that for each g e G there must be an inverse transformation called g~\ also in G , with the property that g~ { gx = x for every point x. In Figure 2.2 we have indicated the inverses of the translations g x and g 2 . One can construct the in- verse of g 2 g\ by completing the parallelogram to the left. In fact a little thought will show that (in any group!) it is always true that (gig i ) -1 = gr'gi~'- It is always understood that a group contains the trivial “iden- tity” transformation e , which has the property that for all jc, ex = x. In fact, since e is the composition of g~ x g of any g and its inverse g~\ the presence of e in G already logically follows from the requirements of 2 and 3. It is easy to see also that gg~ l = e always. In algebra books, one finds additional requirements on groups, for example, that (g\gi)gi = g\(gigi) for all g\, g 2 , and g 2 . For the groups of transformations we always use here, this goes without saying, because it is already implicit in [44] 2.1 Algebraic Theory of Linear Parallel Predicates the intuitive notion of transformation. The associative law above is seen to hold simply by following what happens to each separate point of R. 4. If h is a particular member of G then the set hG defined by hG = \hg | g e G( (that is, the set obtained by composing h with every member of G) is the whole group G and each element is obtained just once. To see this, note first that any element g is obtained: h(h 1 g ) = ( hh~')g = eg = g and h x g must be in the group. If, say, g 0 happens twice. go = hgu go = hg2 we would have both of h~'g 0 = h~'hg x = g x h'g 0 = h-'hgi = g 2 so that gi and g 2 could not be different. 5. In most of what follows, and particularly in §2.3, we want to work with groups G that contain only a finite number of trans- formations. But still, we want to capture the spirit of the ordinary Euclidean transformation groups, which are infinite. There are an infinite number of different distances a figure can be translated in the plane: for example, if g ^ e is any nontrivial translation then g, gg, ggg , . . . are all different. In most cases we will be able to prove the theorems we want by substituting a finite group for the infinite one, if necessary by altering the space R itself! For ex- ample, in dealing with translation we will often use, instead of the Euclidean plane, a torus, as in Figure 2.4. The torus is ruled off in squares, as shown. As our substitute for the infinite set of plane-translations, we consider just those torus-transformations that move each point a certain number m of square units around the large equator, and a certain number Group Invariance of Boolean Inequalities 2.2 [45] Figure 2.4 n of units around the small equator. There are just a finite number of such “translations.” The torus behaves very much like a small part of the plane for most practical purposes, because it can be “cut” and unwrapped (Figure 2.5). Hence for “small” figures and “small” translations there is no important difference between the torus and the plane. This will be discussed further in the intro- duction to Part II, and in Chapter 7. [46] 2.2 Algebraic Theory of Linear Parallel Predicates 2.2 Equivalence-classes of Figures and of Predicates Given a group G, we will say that two figures F and Y are G- equivalent (and we write X = F) if there is a member g of G for / g for which A" = gY. Notice that X = X because F = eX , G X = F implies F = A", because if F = gF then F = g _i F, G G X = F and F = Z imply F = Z, because if F = gF and F = /zZ G G G then F = g/zZ. When we choose a group, we thus automatically also set up a classification of figures into equivalence-classes. This is important later, because it will turn out that the “patterns”— or sets of figures — we want to recognize always fall into such classifications when we choose the right groups. Example: Suppose that G is the set of all permutations of a finite set of points R. (A permutation is any rearrangement of the points in which no two points are brought together.) Then (theorem!) two figures F and F are G-equivalent if and only if they both contain the same number of points. Example: If one wanted to build a machine to read printed letters or numbers, he would normally want it to be able to recognize them whatever their position on the page: That is to say that this machine’s decision should be unaffected by members of the translation group. A more sophisticated way to say the same thing is to state that the machine’s perception should be “translation-invariant,” that is, it must make the same decision on every member of a translation-equivalence class.* *In practice, of course, one wants more from the machine: one wants to know not only what is on a page, but where it is. Otherwise, instead of “reading” what is on the page, the machine would present us with anagrams! Group Invariance of Boolean Inequalities 2.2 [47] In §2.3 we prove an important theorem that tells us a great deal about any perceptron whose behavior is “(/-invariant” for some group <7, that is, one whose predicate f(X) depends only upon the equivalence-class of X. In order to state the theorem, we will have to define what we mean by G-equi valance of predicates . We will say that two predicates <p and <p’ are equivalent, with respect to a group G * 5 if there is a member g of G such that <p(gX) and <p'( X) are the same for every X. It is easy to see that this really is an equivalence relation, that is, <p = (p for any <p, G <p = <p f implies <£>' = <p G G (f = if' and <p r = (p n imply <p = <p". G G Given any predicate <p and group element g, we will define <pg to be the predicate that, for each X, has the value (p(gX). Thus we always have <pg(X) = <p(gX). We will say $ is closed under G if for every <p in 4> and g in G the predicate pg is also in 4>. Three <^s Equivalent under a Rotation Group [48] 2.3 Algebraic Theory of Linear Parallel Predicates Now at last we can state and prove our main theorem. It will show that if a perceptron predicate is invariant under a group (7, then its coefficients need depend only on the G-equivalence classes of their <p's. This theorem will be our single most powerful tool in all that follows, for it is the generalization of our method of §2.1 and will let us convert complicated problems of geometry into (usually) simple problems of algebra. 2.3 The Group-Invariance Theorem Suppose that (1) G is a finite group of transformations of a finite space R ; (2) <f> is a set of predicates on R closed under G; (3) \p is in L(<f>) and invariant under G. Then there exists a linear representation of \p, t = PfV > 0 for which the coefficients depend only on the G-equivalence class of (p, that is, if ip = then P„ = G These conditions are stronger than they need be. To be sure, the theorem is not true in general for infinite groups. A counterexample will be found in §7.10. However, in special cases we can prove the theorem for infinite groups. An example with interesting consequences will be discussed later, in §10.4. It will also be seen that the assumption that G be a group can be relaxed slightly. We have not investigated relaxing condition (2), and this would be interesting. However, it does not interfere with our methods for showing certain predicates to be not of finite order. For when the theorem is applied to show that a particular \p is not in L(<f>) for a particular 4>, it is done by showing that \p is not linear even in the G-closure of <F. Remember that the order of a predicate (§1.3) is defined without refer- ence to any particular set 4> of <^’s! And closing a 4> under a group G cannot change the maximum support size of the predicates in the set. proof: Let \[/(X) have a linear representation ^c*(<£>) p (X) > 0. Group Invariance of Boolean Inequalities 2.3 [49] We use “«(<£>)” instead of a „ to avoid complicated subscripts. Any element g of G defines a one-to-one correspondence (p <pg, that is, a permutation of the (p's. Therefore a (<P)<P(X) = <*(<Pg)<Pg(X) <pe<P for all X, simply because the same numbers are added in both sums. Now, choose an X for which f(X) = /. Since \p is G-invari- ant, and g~ l is an element of G , we must have Z a(<pg)<pg{g~'X) > 0, hence we conclude that for any g in G, if f(X) = 1 , Z oi(<pg)(p(X) > 0. Summing these positive quantities for all g's in G, we see that Z Z a (<Pg) ( f > (X) > 0- geG If we collect together the coefficients for each <p, we then obtain Z Z a (<pg) > ° g(G which is an expression in L(<p ), that is, can be written as Z 0Mv(X) > o. Remember that this depends on assuming that f(X) = 1. Now choose an X for which \p(X) = 0. Then the same argument will show that Z 0(<p)<p(X) < 0. Combining the inequalities for \p = 1 and \p = 0, we conclude that UX) = [Z PMv(X) > o . <pt <f> [50] 2.3 Algebraic Theory of Linear Parallel Predicates It remains only to show, as promised, that <P = <p' => P(<p) = P(<p')- G But ip = ip' means that there is some h such that ip = ip'h, so &(<P) = Z <*(</> g) = Z a (v'hg) = Z a(<p'g ) = P(<p') g(G g(G gtG because the one-to-one correspondence g «-► hg simply permutes the order of adding the same numbers. second proof: Because of the importance of the theorem, we give another version of the propf, which may seem more intuitive to some readers. Choose an X for which \p(X) = 1. Then for any g e G we will have \p(gX) = 1, hence each of the sums Z a M<p(gX) will be positive, and so will their sum Z a(<p)<p(gX) = Z aMvgVT)- <pt< I> ipe$ geG gtG We can think of all the terms of this sum as arranged in a great Group Invariance of Boolean Inequalities 2.3 [51] |$| x \G\ array <*(<Pl)<P\gl + OL(<p 2 )(P2g\ + + tt(<P \)<P \g2 +■ «(<^2)^2g2 + h OL(<P\)<P\g\ G \ + + a(<Pm)<P\*\g\ + «(^|*|) ^ 1 * 1^2 (X). + a{(p\^\)if\^\g\G\ We want to write this in the form finp j + /? 2 2 + ... so we have to collect the coefficients of each (p h To do so, we have to collect together for each those terms <*((?,) for which <pjg k = <p im The sum of those terms is, of course, fi h Our real purpose, how- ever, is not to calculate fi t but to show that (p a — if b — ► fi a — fi b . To do this, suppose that in fact <p a = (p b , which implies that we can find an element g for which <Pa = <Pbg- We will use this to establish a one-to-one correspondence between the two sets of elements of the array that add up to form fi a and fi b . Define “theg 7 -entry of to be a(<p i) pig j where i is determined by (pigj = <p k . Then in the array there is exactly one “gy-entry of (p k ” for each j and k. (It is irrelevant that there may be many different elements h in G that satisfy (p t h = <p k . We are concerned here only with each entry’s occurrence in the array, not with its value.) Now, if a((pi) (pigj is thegyentry of (p b , then <Pigj = <Pb , [52] 2.3 Algebraic Theory of Linear Parallel Predicates and therefore Vigjg = <Pbg = (p a ; hence <p ( gjg is the gjg entry of <p a . If we recall that gj ++ gjg is a one-to-one correspondence within the group elements, as shown in observation 4 of §2.1.1 (see Figure 2.6), we conclude that the corresponding elements in the ( 3 a and / 3 b sums must have the same coefficients, so the sums 0 a and f3 b must be equal. Figure 2.6 Since the same argument holds for the case of = 0, the theorem is proved. Extensions of this to certain infinite spaces are discussed in Chapters 7 and 10. For readers who find these ideas difficult to work with abstractly, some concrete examples of the equivalence classes will be useful; the geometric “spectra” of §5.2 and especially of §6.2 should be helpful. W e shall often use this theorem in the following form : Corollary 1: Any group-invariant predicate ^ (of order k) that satisfies the conditions of the theorem has a representation t = a in (p > 0 Group Invariance of Boolean Inequalities 2.3 [53] where <L* is the set of masks (of degree < k) and a £ = ay if S((p) can be transformed into S(<p') by an element of G. proof : For masks, A = (p B if and only if A = gB for some g e G. G Corollary 2: Let 4> = $i (J • • • (J be the decomposition of <f> into equivalence classes by relation = . Then, under the conditions of the theorem i p can be written in the form t = IZcX'N'iX) > 01 where /V,(T) = | | and <p(X) j | , that is, the number of <p\ of the zth type, equivalent under the group, satisfied by the figure X. proof: \p can be represented as X! a* V > 0 , that is, XI XI > 0 , that is. Z«,X>>0 i if t $>j Z a,N,(X) > 0 i 2.4 The Triviality of Invariant Predicates of Order 1: First application of the group-invariance theorem. Theorem 2.4: Let G be any group of permutations on R that has the property:* for every pair of points ( p,q ) of R there is at least oneg in G such that gp - q. Then the only first-order predicates invariant under G are MX) - \ \X\ > ml] hx) = r \X\ > ml HX) = \\X\ < ml HX) = f \X\ < ml J for some m. *This property, shared by most of the interesting geometric groups, is usually called “transitivity.” Pure rotations about a fixed center constitute an exception, as does the group of all translations parallel to a fixed direction in the plane. But the groups of all rotations about all centers, or all translations, etc., are transitive. [54] 2.4 Algebraic Theory of Linear Parallel Predicates proof: Since all the one-point predicates < p p are equivalent, we can assume that HX) X! onp p > 6 , peX that is, the coefficients are independent of p. But 2 a <p p > 6 is the same as 2^ > 0/a for a > 0. (For negative a we have to use “<” instead.) And H<P P = \x\ . ptX Thus order- 1 predicates invariant under the usual geometric groups can do nothing more than define simple “> w”-type in- equalities on the size or "area” of the figures. In particular, taking the translation group G we see that no first-order perceptron can distinguish the A’s in the figure on p. 46 from some other transla- tion-invariant set of figures of the same area. 2.4.1 Noninvariant Predicates of Order 1 If one gives up geometric group invariance there are still some simple but useful predicates of order 1, for we can represent in- equalities related to the ordinary integrals. For example, the following predicates of plane figures can be realized: let x p and y p be the x and y coordinates of the point p: Figure 2.7 Group Invariance of Boolean Inequalities 2.4 [55] \Xhas more area in the right half-plane than in the left] = X, <p P - 21 <Pr > ° . right left half half [The center of gravity of X is right of centerl = [2 x p <p p > 0] (see Figure 0.3, p. 11), [The "th central moment of X about the origin is greater than 01 = |" 2 <p p ( Vx 2 p + . ylY > 6], and so on. But these “moment-type” predicates are restricted to having their reference coordinates in the absolute plane, and not in the figure X. For example one cannot realize, with order 1, [The second moment of X about its own center of gravity is greater than 0] because that predicate is invariant under the (transitive) transla- tion group. MATHEMATICAL REMARK! There is a relation between these observations and Haar’s theorem on the uniqueness up to a constant factor of invariant measures. For finite sets and transitive groups the unique Haar measure is, in fact, the counting-function ijl{X) = \X\. The set function defined by M( X ) = ^ 1 &iXi = ^ 'j OLi x ( eX satisfies /z(T) + ^i(Y) = ii(X U Y) + /i(X D Y). If we defined in- variance by n(X) = v(gX) it would follow immediately from Haar’s theorem that /i(X) = c \X\ , where c is a constant. Our hypothesis on n is slightly weaker since we merely assume H(X) > 0<=>n(gX) > 0, and deduce a correspondingly weaker conclusion, that is, (n(x) > e) 4=Xc\x\ > 0 ). In the general case the relation between an invariance theorem and the theory of Haar measure is less clear since the set function 2 a^(p(X) is not in general a measure. This seems to suggest some generalization of measure but we have not tried to pursue this. Readers interested in the history of ideas might find it interesting to pursue the relation of these results to those of Pitts and McCulloch [1947]. Parity and One-in-a-box Predicates 3 3.0 In this chapter we study the orders of two particularly interesting predicates. Neither of them is “geometric,” because their in- variance groups have too little structure. But in §5, we will apply them to geometric problems by picking out appropriate “sub- groups” which have the right invariance properties. 3.1 The Parity Function In this section we develop in some detail the analysis of the very simple predicate defined by ^AparityUO = f M is an odd numberl. Our interest in \p PARIXY is threefold: it is interesting in itself; it will be used for the analysis of other more important functions; and, especially, it illustrates our mathematical methods and the kind of questions they enable us to discuss. Theorem 3,1.1: \p PxmTY is of order \R\ . That is, to compute it re- quires at least one partial predicate whose support is the whole space R. proof: Let G be the group of all permutations of R. Clearly \[/ parity is invariant under G. (Because moving points around can’t change their number!) Now suppose that ^ PARIXY = > 01 where the <?/ are the masks with |S(<p/) | < K. The group invariance theorem tells us that we can choose the coefficients so that they depend only on the equivalence classes defined by = . But then depends only on |S(<p/) |. To see this observe (1) all masks with the same support are identical, and (2) all sets of the same size can be transformed into one another by elements of G, <Pi = <Pj <=» |S(*,)| = l*S'(^ j/ )l • G Thus \^ PARITY can be written, using Corollary 2 of §2.3, as “ K Z a i Z > 0 _ - K Z ajNjvn > o 7 = 0 L*y J j Parity and One-in-a-box Predicates 3.1 [57] where {<F,j is the set of masks whose supports contain exactly j elements. We now calculate for an arbitrary subset X of R , Nj(X) = Z <p(X). <pt$j Since <p(X) is 1 if S(<p) C X and 0 otherwise, Nj(X) is the number of subsets of X with j elements, that is, \x\\ _ \X\(\X\ - 1) ... (1*1 — j + 1 ) j ) j ! which is a polynomial of degree j in |^|. It follows that K Z «,Nj{X) 7 = 0 is a polynomial of degree < K in \X\, say P( \X\). Now consider any sequence of sets X 0 , X u ..., X \ R \ such that \Xi | = /. Since P{ \X\ ) > 0 if and only if \X\ is odd, P(\X 0 \) < 0, P( \X x I) > 0, P(\X 2 \ < 0,..., crosses He X-<vris that is P(\X\) I J? | times as \X\ increases from 0 to |/?| . But P is a polynomial of degree K. It follows (see Figure 3.1) that K > |/?| . Q.E.D. f^j /So 0 / / A \ = = = Figure 3. 1 A polynomial that changes direction K - 1 times must have degree at least K. [58] 3.1 Algebraic Theory of Linear Parallel Predicates From this we obtain the Theorem 3.1.2: If ^ PARITY e L($) and if <£ contains only masks, then <l> contains every mask . proof: Imagine that, even if contains only masks, and the mask whose support is A does not belong to <£, it were possible to write * PARITY 23 > ° <pt <f> Now define, for any \p, the predicate \p A (X) to be \ p(X P) A). Then \^p AR i TY ls the parity function for subsets of A , and has order \A\ by the previous theorem. To study its representation as a linear combination of masks of subsets of A we consider (p A for <p e <F. If S((p) C A, clearly (p A = otherwise <p A is identically zero since s(ip) (t a =» s(<p) dx n a =* ip(x n a) = o r ' =* ip A {X) = 0 . Thus, either S{\p A ) is a proper subset of A, or <p A is identically zero. Now let $ A be the set of masks in $ whose supports are subsets of A. Then A PARITY > 0 And for all ipe$ A , \S((p) | <\A\. But this is in contradiction with Theorem 3.1.1 since it implies that the order of \p PARlTY is less than \A | . Thus the hypothesis is impossible and the theorem proven. Corollary 1: If \p PARITY e L($) then $ must contain at least one tp for which |S(</?)| = \R\. The following theorem, also immediate from the above, is of interest to students of threshold logic: Corollary 2: Let <£ be the set of all ^parity for proper subsets A of R. Then ^parity f £($)• The further analysis of ^ PARITY in Chapter 10 shows that functions which might be recognizable, in principle, by a very large percep- Parity and One-in-a-box Predicates 3.2 [59] Group-invariant coefficients for the | = 3 parity predicate. tron, might not actually be realizable in practice because of im- possibly huge coefficients. For example, it will be shown that in any representation of ^ PARITY as linear in the set of masks the ratio of the largest to the smallest coefficients must be 2 |/?l _1 . 3.2* The “One-in-a-box” Theorem Another predicate of great interest is associated with the geo- metric property of “connectedness.” Its application and interpre- tation is deferred to Chapter 5; the basic theorem is proved now. Theorem 3.2: Let A u . . . , A m be disjoint subsets of R and define the predicate = \ \X C\ A t \ > 0, for every A ,1, that is, there is at least one point of X in each A If for all /, \A i | = 4m 2 , then the order of \[/ is > m. *This theorem is used to prove the theorem in §5.1. Because §5.7 gives an in- dependent proof (using Theorem 3.1.1), this section can be skipped on first reading. [60] 3.2 Algebraic Theory of Linear Parallel Predicates Corollary: If/? = A, U A 2 U ■ ■ • U A m , the order of \p is at least il R\' n . proof: For each i = 1, . . . t m let G ; be the group of permuta- tions of R which permute the elements of A , but do not affect the elements of the complement of A h Let G be the group generated by all elements of the G t . Clearly is invariant with respect to G. Let $ be the set of masks of degree k or less. To determine the equivalence class of any <p e <i> consider the “occupancy numbers”: ls(*) n A t |. Note that (p, = <p 2 if and only if |5(<pi) n I = IS 1 ^) H A, | G for each i. Let $ 2 , ... be the equivalence classes. Now consider an arbitrary set X and an equivalence class We wish to calculate the number Nj(X) of members of <$, satisfied by X , that is, Nj(X)~ \W\vtij AND S(<p) C A r ||. A simple combinatorial argument shows that \x n A m Is(^) n aJ where y\ = y(y - 0 ••• (y - « + 0 n n\ Nj(X) = \xhaa \ i \xnA 2 \ K \S(<p) n A ,| / \|S(*>) n A : and (p is an arbitrary member of <£/. Since the numbers |S(^>) Pi A 1 1 depend only on the classes and add up to not more than k, it follows that Nj(X) can be written as a polynomial of degree k or less in the numbers x, = \X p A t \: Nj(X) = Pj(x u ..., x m ). Parity and One-in-a-box Predicates 3.2 [61] Now let [2a^<p > 01 be a representation of ^ as a linear threshold function in the set of masks of degree less than or equal to k. By the argument which we have already used several times we can as- sume that a p depends only on the equivalence class of <p and write Za^(X) = 2ft Z v(X) = 20jNj(X) = 2 PjPjOct j Xm) which, as a sum of polynomials of degree at most k , is itself such a polynomial. Thus we can conclude that there exists a polynomial of degree a most k , Q{x x , . . . , x m ) with the property that HX) = \Q(x u ...,x m ) > 01 (x, = df \X n A,\), that is, that if all x t lie in the range 0 < Xi < 4m 2 , then Q(x i,...,x m )>0 <=> Xi > 0 for all /. In Q(x \, . . . , x m ) make the formal substitution, = [t - (2 i - l)] 2 . Then Q(x i, . . . ,x m ) becomes a polynomial of degree at most 2k in t. Now let t take on the values t = 0, 1, ... , 2m. Then Xi = 0 for some /, if t is odd; in fact, for i = \{t + 1); but Xi > 0 for all i, if / is even. Hence, by the definition of the \p predicate, Q must be positive for even t and negative or zero for odd t. By counting the number of changes of sign it is clear that 2k > 2m, that is, k > m. This completes the proof. The “And/Or” Theorem 4 4.0 In this chapter we prove the “And/Or” theorem stated in §1.5. Theorem 4.0: There exist predicates \p x and \p 2 of order 1 such that \p x A ypi and V \[/ 2 are not of finite order . We prove the assertion for \p\ A ^ 2 - The other half can be proved in exactly the same way. The techniques used in proving this theorem will not be used in the sequel and so the rest of the chapter can be omitted by readers who don’t know, or who dis- like, the following kind of algebra. 4.1 Lemmas We have already remarked in §1.5 that if R = A U B U C the predicate \\X H A \ > \X Pi C|l is of order 1, and stated without proof that if A, B, and C are disjoint (see Figure 4.1), then \{\x n a | > \x n c|) a (|* n b\ > \x n c|)i is not of bounded order as |/?| becomes large. We shall now prove this assertion. We can assume without any loss of generality that the three parts of R have the same size M = \A \ = \B\ = |C|, and that \R\ = 3 M. We will consider predicates of the stated form for different-size retinas. We will prove that If \p M {X) is the predicate of the stated form for \R\ = 3 M, then the order of \p M increases without bound as M °o . The proof follows the pattern of proofs in Chapter 3. We shall assume that the order of {\p M } is bounded by a fixed integer N The “And/Or” Theorem 4.1 [63] for all Af, and derive a contradiction by showing that the asso- ciated polynomials would have to satisfy inconsistent conditions. The first step is to set up the associated polynomials for a fixed M. We do this by choosing the group of permutations that leaves the sets A, B , and C fixed but allows arbitrary permutations within the sets. The equivalence class of a mask p is then charac- terized by the three numbers, | A H -S'(<^)|, | B H *S(<^)|, and | C n 5 (^) 1 . For any given mask <p and set X the number of masks equivalent to p and satisfied by X is _ / \a n x\ \ / \b n x\ \ i \c n x\ \ N ^ X) ” \U n sop)|j x VI* n s{<p)\j x \|c n s^)\)' Since we are assuming |5 , (<^)| < N, we can be sure that N ^(X) is a polynomial of degree at most N in the three numbers x = \a n x |, >> = 1*0*1, z = |c n *|. Let $ be the set of masks with |support| < N. Enumerate the equivalence classes of $ and let N,-(X) be the number of masks of the ith class satisfied by X. The group invariance theorem allows us to write iM*) = WPiNAX) > 01. The sum IjfiiN^X) is a polynomial of degree at most N in x,y,z. Call it P M (x,y , z). Now, by definition, for those values of x,y, z which are possible occupancy numbers, that is, nonnegative integers < M, P M (x,y,z) > 0 if and only if x > z and y > z. We shall show, through a series of lemmas, that this cannot be true for all M. Lemma 1 : Let P\ (x, y, z), P 2 (x, y , z), . . . , be an infinite sequence of nonzero polynomials of degree at most N, with the property that for all positive integers x,y , z less than M x > z and y > z implies P M {x,y,z) > 0 SEPARATION CONDITIONS x < z or y < z implies P M (x,y\z) < 0. [64] 4.1 Algebraic Theory of Linear Parallel Predicates Then there exists a single nonzero polynomial P(x f y,z) of degree at most N with the property that the separation conditions , with P in the place of P M , hold for ALL positive integral values of x,y,z. It should be observed that we have had to weaken the separation conditions by allowing equality in both conditions since inequal- ity would not be preserved in the limit. Consequences of this will make themselves felt in the proof of Lemma 2. proof: Write T P M (x,y,z) = X C MJ m t {x, y, z), /« 1 where m x , m 2l . .., m T is an enumeration of the monomials of degrees < N in x, y , z. Since the conditions on P M are preserved under multiplication by a positive scaling factor, we can assume that 2C ! „, f = 1. Now consider the set of points in T-dimensional space Cm = (^a/,1, £ a/,2, • • - , M - 1,2,.... These all lie in a compact* set — the surface of the unit T-dimen- sional sphere. There is, therefore, a subsequence C Mj which con- verges to a limit C Mj C (c j , Cj , i c f) in the sense that, for each i, lim c M j — c t . The polynomial T P(x,y,z) = ^ Cjmi(x,y,z) /= 1 See index. The “And/Or” Theorem 4.1 [65] inherits the separation conditions for all positive integral values of x ,y,z. That it is not identically zero follows from the fact that the c, inherit the condition 2) cf = 1 . In order to prove our main theorem, we first establish a corre- sponding result for polynomials in two variables, and later (Lemma 3) adapt it to P(x,y, z). Lemma 2: If a polynomial f(a,f3) satisfies the following conditions for all integral values of a and /3, then it is identically zero: a > 0 and 13 > 0 implies f(a,/3) > 0, a < 0 or 13 < 0 implies /(a,/?) < 0. proof: Suppose that /(«,/?) could satisfy these conditions yet not be identically zero. Then we could write it in the form /(<*,/?) = p N g(a) + r(a^) with #(a) not identically zero and with r(a,/3) of degree lower than N in /3. We can then find a number a 0 > 0 such that neither of g(dhao) is zero, and then we can choose a number I3 0 so large that all four of the inequalities |/8og(±«o)l > k(±«o, ±0o)l are satisfied so that /-(±a 0 , ±/? 0 ) cannot affect the sign of /(±a 0 , d=/? 0 ). Then, since /( -aoA) < 0 we have g(-«o) < 0; hence (-0o) N g(-<*o) > 0; hence f(-a 0 ,-/3 0 ) > 0; which contradicts the conditions, and hence proves the lemma. [66] 4.2 Algebraic Theory of Linear Parallel Predicates 4.2 A Digression on Bezout’s Theorem Readers familiar with elementary algebraic geometry will observe that the lemma would follow immediately from Bezout’s theorem if the conditions could be stated for all real values of a and /?. We would then merely have to prove that the doubly infinite L of the Figure 4.2 is not an algebraic curve. fK/3Ko $(<*-, /3) >0 + oc ~/3 + /3 f(° <0/3) 4 o Figure 4.2 Bezout’s theorem tells us that if the intersection of an algebraic curve L with an irreducible algebraic curve Y contains an infinite numbpr of points, it must contain the whole of Y. But the L con- tains the positive half of the y axis. Straight lines are irreducible, so L would have to contain the entire y axis if it were algebraic. Unfortunately, because our conditions hold only on integer lat- tice-points, we must allow for the possibility that f(a,0) = 0 takes a more contorted form as, for example, in Figure 4.3. Part of the pathological behavior of this curve is irrelevant. Since a polynomial of degree N can cut a straight line only N times, the incursions into the interiors of the quadrants can be confined to a bounded region. This means that the curve /(a, £?) = 0 must “asymptotically occupy” the parts of the “channel” illustrated in Figure 4.4. [68] 4.2 Algebraic Theory of Linear Parallel Predicates It seems plausible that a generalization of Bezout’s theorem could be formulated to deduce from this that the curve must enter the negative halves in a sense that would furnish an immediate and more illuminating proof of our lemma. We have not pursued this conjecture. Lemma 3: If a polynomial P{x,y,z) satisfies the following condi- tions for all positive integral values of x, y, and z, then it is iden- tically zero: x > z and y > z implies P(x,y,z) > 0, x < z or y < z implies P(x,y,z) < 0. proof: Suppose that P(x,y,z) had these properties, but were not identically zero. Define Q(a,f3,z) = P(z + a, z + P,z) and write Q(oc,P,z) = z M f(a,P) 4- r(a,0,z), where r is of degree less than M in z, and / (a, 0) is not identically zero. Then we can show that / must satisfy the conditions in Lemma 2: Choose any a 0 and 0 O for which / (a 0 , 0o) ^ 0. Choose a z 0 so large that Zo 4- <*0 > o, z 0 + 00 > o, and \zo/(a 0 , 0 O )I > |/*(a 0 , 0 O , z 0 )|. It follows that /(« o ,0o) - 0 Q(<* o,Po, z o) > 0, that is, if and only if P(z 0 + a 0 , z 0 4- 0 O , z 0 ) > 0. Thus olq > 0 and 0 O > 0 => z 0 4- a 0 > z 0 and z 0 4- 0 O > z 0 =► P(zo + ot 0 ,z 0 4- 00, z 0 ) > 0 =* f(a 0j p 0 ) > 0, and similarly, «o < 0 OR 0 O < 0 =» f(a 0 ,p 0 ) < 0. But this is true for all a 0 , 0 O . Thus by the Lemma 2, / ( a , 0) = 0. It follows that z) is of degree zero in z, which is only pos- sible if Pis identically zero. This concludes the proof of the Ana/Or Theorem. GEOMETRIC THEORY OF LINEAR INEQUALITIES [70] Geometric Theory of Linear Inequalities Introduction to Part II The analysis of geometric properties of perceptrons begins, in Chapter 5, with the study of the predicate ^connected 1 Is the figure X all connected together in the sense that between any two points of figure there is a continuous path that lies entirely within the figure (see §0.5)7 We chose to investigate connectedness be- cause of a belief that this predicate is nonlocal in some very deep sense; therefore it should present a serious challenge to any basically local, parallel type of computation. Originally, we tried to prove that ^connected is not of finite order by exploiting its sensitivity to small changes in X — any connected figure is easily converted to a disconnected figure by making a thin cut or by adding an isolated point — but we were unable to convert this to a real proof. The successful methods were based on using the group-invariance theorem, but indirectly. We recall that in dealing with ^ PAR1XY we began by identifying the largest possible group of transformations of R that leaves \p invariant — in the case of ^ PARITY the group of all permutations. We then used this group to coalesce the <p’s into equivalence classes, and eventually reduced the problem about representing ^ in £(<£) to a problem about polynomials in enumeration functions. But in the case of ^ CONNECTED , we find that any attempt to apply this technique directly leads to severe problems associated with the representation of a general topological transformation on a discrete retina. Fortunately, it was possible to “reduce” the prob- lem to a simple one involving more tractable groups. In fact, we see in §5.1 that if a perceptron could discriminate between just certain restricted instances of connectedness, then it could be made to simulate the “one-in-a-box” predicate of §3.2. If this were possible, we would have, logically: connected is finite order == ^ ^connected | restricted is finite order =► i^one-in-a-box is finite order, and since the last is false, so is the first. Toward the end of Chapter 5, this firmly negative result — that ^connected is not of finite order — is generalized to show that the same is true of all topological predicates, with one single type of exception. Only the Euler number , the lowest and simplest of all Introduction to Part II [71] Yo\r 'Hiose v , eacWi^s ii/ifefec Wd i* t c*<£ pey-cep'f/on "Hi osc concerned <ju i H p^ac-hcci ( CX pji | I'ks, (Viec^a+ i\/€) \res"<z H-S. l n (> 6.6 dej^y-v/e wu/ck, mcV'c evu ph (*£{£ the topological invariants, can be recognized by the finite-order predicate-scheme. In Chapter 6 we obtain a series of positive results. There are a variety of geometric properties, in addition to ^convex and ^circle mentioned in §0.5, that are quite clearly of finite (and in fact of rather low) order. These include particular forms like triangles or squares or letters of the alphabet. From some of these a type of description emerges that we call “geometric spectra.” These can be regarded either as local geometric properties or as simple sta- tistical qualities of the patterns. The fact that perceptrons can recognize certain patterns related to these spectra is probably responsible for some of the false optimism about the capabilities of perceptrons in general. At the end of Chapter 6 we see that while these patterns can be identified in isolation, the perceptron cannot detect them in more complicated contexts. Chapter 7 is a curious detour. It turns out that certain predicates that do not seem at first to have finite order — such as symmetries , or similarities between pairs of figures — can in fact be realized by finite-order predicate-schemes. But the realizations have a peculiar unreality, for their coefficients grow at such astronomical rates as to be physically meaningless. The incident seems to have an important moral; even within a simple combinatorial subject such as this, one must be on guard for nonobvious codes or representations of things. The linear forms obtained by the “strat- ification” method of Chapter 7 have a quality somewhat like the Godel numbers of logic, or the “nonstandard models” of mathe- matical analysis. Our intuition is still weak in the field of com- putation, and there are surely many more surprises to come. We study the diameter-limited perceptron in Chapter 8. Here the situation is much simpler, and one does not even need the alge- braic theory to obtain generally negative results. For the most part, it turns out that the diameter-limited machines are subject to limitations similar to those of the order- 1 machines. In certain respects they seem different: for example, in their ability to ap- proximate certain integral-like computations. This makes it pos- sible for them to recognize iA CIRCL e within some accuracy limita- tions. And, they can compute a narrowly limited class of predi- cates related to the Euler number. The predicate ^connected seemed so important in this study that we felt it appropriate to try to relate the perceptron’s performance [72] Geometric Theory of Linear Inequalities to that of some other, fundamentally different, computation schemes. In Chapter 9 we study it in the context of a wide variety of models for geometric computation. We were surprised to find that, for serial computers, only a very small amount of memory was required. One might have supposed that something like a “push-down list” would be needed so that the machine could retrace its steps in the course of exploring the maze of possible paths through a figure. Representing Geometrical Patterns We are about to study a number of interesting geometrical predi- cates. But as a first step, we have to provide the underlying space R with the topological and metric properties necessary for de- fining geometrical figures; this was not necessary in the case of predicates like parity and others related to counting, for these were not really geometric in character. The simplest procedure that is rigorous enough yet not too mathematically fussy seems to be to divide the Euclidean plane, E 2 , into squares, as an infinite chess board. The set R is then taken as the set of squares. A figure X E of E 2 is then identified with the squares that contain at least one point of X E . Thus to any subset X E of E 2 corresponds the subset X of R defined by xeX if at least one point of X E lies in the square x. Now, although X and X E are logically distinct no serious confusion can arise if we identify them, and we shall do so from now on. Thus we refer to certain subsets of R as “circles,” “triangles,” etc., meaning that they can be obtained from real circles and triangles by the map X E — * X. Of course, this means that near the “limits of resolution” one begins to obtain apparent errors of classification because of the finite “mesh” of R. Thus a small circle will not look very round. If it were necessary to distinguish between E 2 and R we would say that two figures X E , X’ E of E are in the same R-tolerance class if X = X'. There is no problem with the translation groups that play the main roles in Chapters 6, 7, and 8. There is a serious problem of handling the tolerances when dis- cussing, as in §7.6, dilations or rotations. Curiously, the problem does not seem to arise in discussing general topological equivalence, in Chapter 5, because we can prove all the theorems we know by using less than the full group of topological transformations. ■- il — — -| H— - -Jp js | - WT - - - - M -- r r ! T 1 1 ^ connected * A Geometric Property with Unbounded Order 5 5.0 Introduction In this chapter we begin the study of connectedness. A figure X is connected if it is not composed of two or more separate, non- touching, parts. While it is interesting in itself, we chose to study the connectedness property especially because we hoped it would shed light on the more basic, though ill-defined, question of local vs. global property. For connectedness is surely global. One can never conclude that a figure is connected from isolated local experiments. To be sure, in the case of a figure like one would discover, by looking locally at the neighborhood of the isolated point in the lower right corner, that the figure is not connected. But one could not conclude that a figure is connected, from the absence of any such local evidence of disconnectivity. If we ask which one of these two figures is connected Figure 5.1 it is difficult to imagine any local event that could bias a decision toward one conclusion or the other. Now, this is easy to prove , for example, in the narrow framework of the diameter-limited concept of local (see §0.3 and Chapter 8). It is harder to establish for the order-limited framework. But the diameter-limited case gives us a hint: by considering a particular subclass of figures we might be able to show that the problem is equivalent to that [74] 5.0 Geometric Theory of Linear Inequalities of recognizing a parity , or something like it, and this is what we in fact shall do. 5.1* The Connectedness Theorem Two points of R are adjacent if they are squares with a common edge.t A figure is connected if, given any two points (that is, “squares”) p\,p 2 of the figure, we can find a path through adja- cent points from p x to p 2 . Theorem 5,1: The predicate ^connected (jO = fjf is connected] is not of finite order (§1.6), that is, it has arbitrarily large orders as |/?j grows in size. proof: Suppose that ^connected (X) could have order < m. Con- sider an array of squares of R arranged in 2m + 1 rows of 4m 2 squares each (Figure 5.2). Let Y 0 be the set of points shaded in the diagram, that is, the array of points in odd-numbered rows, and let T, be the remaining squares of the array. Let F be the family of figures obtained from the figure F 0 by adding subsets of Y\ % • r-i rA ns §§ su m Hr row 2m I you) Z*[Yl H jjf §1 Jfjj Figure 5.2 *We will give two other proofs from different points of view. The proof in §5.5 is probably the easiest to understand by itself, but the proof in §5.7 gives more information about the way the order grows with the size of R. tWe can’t allow corner contact, as in^\l, to be considered as connection. For this would allow two “curves” to cross without “intersecting” and not even the Jordan curve theorem would be true. The problem could be avoided by dividing E 2 into hexagons instead of squares! A Geometric Property with Unbounded Order 5.2 [75] that is, X e F if it is of the form Y 0 U X u where X x C Y\. Now X will be connected if and only if X x contains at least one square from each even row; that is, if the set X x satisfies the “one-in-a- box” condition of §3.2. To see the details of how the one-in-a-box theorem is applied, if it is not already clear, consider the figures of family F as a subset of all possible figures on R. Clearly, if we had an order-/: predicate ^connected that could recognize connectivity on R , we could have one that works on F; namely the same predicate with constant zero values to all variables not in Y 0 U Y x . And since all points of the odd rows have always value 1 for figures in F, this in turn means that we could have an order-/: predicate to decide the one- in-a-box property on set Y x \ namely the same predicate further restricted to having constant unity values to the points in Y 0 . Thus each Boolean function of the original predicate ^connected is replaced by the function obtained by fixing certain of its variables to zero and to one; this operation can never increase the order of a function. But since this last predicate cannot exist, neither can the original ^connected - This proof shows that ^connected has order at least C |/?| 1/3 . In §5.7 we show it is at least C |/?| 1/2 . 5.2 An Example Consider the special case for k = 2, and the equivalent one-in-a- box problem for a space of the form in which m = 3 and there are just 4 squares in each box. Now consider a \p of degree 2; we will show that it cannot characterize the connectedness of pictures of this kind. Suppose that \p = > 0] and consider the equivalent form, symmetrized under the full group of permutations that interchange the rows and [76] 5.2 Geometric Theory of Linear Inequalities permute within rows.* Then there are just three equivalence- classes of masks of degree < 2, namely: Single points: <p ) = x h Point-pairs: if)) = x t Xj (x f and Xj in same row), Point-pairs: ip\] = x t Xj (x, and Xj in different rows). Hence any order-2 predicate must have the form i// = \a x N x (X) + a u N n (X) + a l2 N l2 (X) > 0} where TVj, N n , and N X1 are the numbers of point sets of the respective types in the figure X. Now consider the two figures: 1 1 i i l§ II fg §§ i iH H P n 8 Hf i§ a ^CONNECTED (Tl ) — 1 ^CONNECTED (^2) — 0 In each case one counts N x = 6, N n = 6, N X2 = 9; hence \[/ has the same value for both figures. But X\ is connected while X 2 is not! Note that here m = 3 so that we obtain a con- tradiction with \Ai | = 4, while the general proof required \A { | = 4 m 2 = 36. The same is true with \A f | = 3, m = 4, because (3, 1, 1, 1) ^ (2, .2, 2, 0). It is also known that if m = 6, we can get a similar result with \A t \ = 16. This was shown by Dona Strauss. The case of m = 3, \A t | = 3 is of order 2, since one can write CONNECTED = [3AW - 2 N U (X) > 81. *Note that this is not the same group used in proving Theorem §3.2. There we did not use the row-interchange part of the group. A Geometric Property with Unbounded Order 5.4 [77] The proof method used in these examples is an instance of the use of what we call the “geometric /7-tuple spectrum,” and the general principle is further developed in Chapter 6. 5.3 Slice-wise Connectivity It should be observed that the proof in §5.1 applies not only to the property of connectivity in its classical sense but to the stronger predicate defined by: \ There is a straight line L such that X does not intersect L and does not lie entirely to one side ofL). The general definition of connectedness would have “curve” for L in- stead of “straight line,” and one would expect that this would require a higher order for its realization. 5.4 Reduction of One Perceptron to Another In proving that ^connected is not of finite order, our approach was first to prove this about a different (and simpler) predicate ^one-in-a-box* Then we showed that ^connected could be us ed, on a certain subset of figures , to compute i^one-in-a-box : therefore its order must be at least as high. There are, of course, many other figures that ^connected wd l h ave to classify (in addition to those that contain all points of To in §5.1), but it was sufficient to study the predicates’ behavior just on this subclass of figures. We will use this idea again, many times, but the situation will be slightly more complicated. In the case just discussed, both predi- cates were defined on figures in the same retina, but in the sequel, we will often want to establish a relation between two predicates defined on different spaces. The flexibility to do this is established by the following simple theorem. 5.4.1 The Collapsing Theorem This theorem will enable us to deduce limits on the order of a predicate \p on a set R from information about the order of a re- lated predicate \p on a set R. the j> root; ^ § s - 1 j +V\*r\ ah R X X F PARITY 7 ! X ( ve eich r CONNECTED v slav'd "Hv*. co II x iruj \~^eovevA\ We ^\vt -fVu? -^ovvvial -fo v* COVH^U . Let F be a function that associates with any figure X in /?, a figure X = F{X) in R. Now let f be any predicate on R. This [78] 5.4 Geometric Theory of Linear Inequalities predicate defines a predicate \[/ on R by the computation ux) = knx)) = ux). Theorem 5.4.1: The order of \p is > the order of \p, provided that each point of R depends upon at most one point of R, in the sense that for each point x of R , either it has a constant value x e X for all X , or x i X for all X , or else there is a point x such that either \x e X] - \x e X] for all X , or \x e X] = \x f X] for all X proof : Suppose that has a realization of order K : \^OCi<fi > 0}. Then \p has a realization [2 «,-<?/ > where (Pi(X) is 0i{F{X)). To see that |5(^/) | < K , recall that <Pi depends on at most K points of R , and these in turn depend on at most K points of R. So <Pi(X) = <pi(F(X)) depends on at most K points of R. Example: A typical application of this construction is illustrated as follows (see Figure 5.3). The set R has three points (jcj, x 2 , x 3 ). Figure 5.3 A Geometric Property with Unbounded Order 5.5 [79] The set R has 45 points. In the diagram, these fall into three classes: 8 points shown as white, 25 points shown as black, and 12 points labeled x,- or x,-. F is defined in the following way: Given a set X, in R, F(X) must contain all the black squares, no white squares, the squares labeled x, only if x { tX, and the squares labeled x,- only if x-^X. 5.5 Huffman’s Construction for ^connected We shall illustrate the application of the preceding concept by giving an alternative proof that ^connected has no finite order, based on a construction suggested to us by D. Huffman. The intuitive idea is to construct a switching network that will be connected if an odd number of its n switches are in the “on” position. Thus the connectedness problem is reduced to the parity problem. Such a network is shown in Figure 5.3 for n = 3. The interpretation of the symbols x, : and x, is as follows: when x,- is in the “on” position contact is made wherever x f appears, and broken wherever x, appears; when x, is in the “off” position contact is made where x,- appears and broken where x, appears. It is easy to see that the whole net is connected in the electrical and topological sense if the number of switches in the “on” position is 1 or 3. The generalization to n is obvious: 1. List the terms in the conjunctive normal form for ^ PAR , TY con- sidered as a point function, which in the present case can be written (x, Vi 2 V x 3 ) A (x, V x 2 V x 3 ) A (xj V x 2 V x 3 ) A (xi V x 2 V x 3 ) 2. Translate this Boolean expression into a switching net by inter- preting conjunction as series coupling and disjunction as parallel coupling. 3. Construct a perceptron which “looks at” the position of the switches. The reduction argument in intuitive form is as follows: the Huffman switching net can be regarded as defining a class F of geometric figures which are connected or not depending on the parity of a certain set, the set of switches that are in “on" position. We thus see how a perceptron for \ p connected on one set R can be used as a perceptron for \^ PAR 1TY on a second set R . As a perceptron for ^parity . it must be of order at least [80] 5.5 Geometric Theory of Linear Inequalities | R |. Thus the order of ’/'connected must be of order | R j. We can use the collapsing theorem §5.4.1 to formalize this argument. But before doing so note that a certain price will be paid for its intuitive simplicity: the set R is much bigger than the set R; in fact \R | must be of the order of magnitude of 21*1, so that the best result to be obtained from the construction is that the order of ’/'connected must increase as log \R\. This gives a weaker lower bound than was found in §5.1: log |/q com- pared with |/?| 1/3 . To apply the collapsing theorem we simply define the space R to be the three-point space R described at the end of §5.4. Then \ p parity on R is equal to ’/'connected on R for those figures obtained by applying F to figures on R . The collapsing theorem states that the order of ’/'parity is < the order of ’/'connected- 5.6 Connectivity on a Toroidal Space | R | Our earliest attempts to prove that connectedness has unbounded order led to the following curious result: Theorem 5.6: The predicate ’/'connected on a 2/2 x 6 toroidally con- nected space has order > n. The proof is by construction: consider the space in which the edges e , e' and /, /' are considered to be identical (see also Figure 2.5). Now consider the family F of subsets of R that satisfy the conditions 1. All the shaded points belong to each X e F, 2. For each X e F and each /, either both points marked a t or both points bi are in F, but no other combinations are allowed. Then it can be seen, for each I e F, that X has either one con- nected component or X divides into two separate connected A Geometric Property with Unbounded Order 5.7 [81] figures. Which case actually occurs depends only on the parity of the number of a - s in X. Then using the collapsing theorem and Theorem §3.1.1, we find that ^ CONNECXED has order > The idea for Theorem 5.6 came from the attempt to reduce connectivity to parity directly by representing the switching diagram shown in Figure 5.4. If an even number of switches are in the “down” position then x is connected to x' and y to y' . If the number of down switches is odd, x is connected to y' and x' to y. This diagram can be drawn in the plane by bringing the vertical connections around the end (see Figure 5.11); then one finds that the predicate \x is connected to x'l has for order some constant multiple of \R\ l/2 . If we put the toroidal topology on P, the order is known (§5.6) to be greater than a constant fraction of |/?|; this is also true for a 3-dimensional Euclidean R. These facts strongly suggest that our bound for the order of ^connected is too low f° r the plane case. 5.7 A Better Bound for ^connected * n the Plane The following construction shows that the order of ^connected is > const *{\R \ 1//2 ) for two-dimensional figures. It results from modifying Figure 5.4 so as to connect x to x r . This is easy for the torus, but for a long time we thought it was impossible in the plane. We first define a “4-switch” to be the pair of figures Figure 5.5 [82] 5.7 Geometric Theory of Linear Inequalities In the down state, one can see that Pi is connected to q [i+ 1)4 , where (y) 4 is the remainder when j is divided by 4. In the up state, we have Pi is connected to < 7 (/ _ i )4 . Now consider the effect of cascading n such switches, as shown in Figure 5.6. This simply iterates the effect: in fact, if d switches are down and u switches are up , we have Pi is connected to q {i+c j^ u ) 4 for all i. Now, since every switch is either up or down , d + u = n, hence q (i + d ~ u)4 = q(i + 2d-n)4i and we notice that this depends only upon the parity of dr For" Next, we add fixed connections that tie together the terminals < 7(i-«)4'» *7(2-«)4 i and <7(3 -«) 4 * Then if d is even, p u p 2 , p 3 are tied together while if d is odd, /? 3 , /? 0 , p { are tied together. A Geometric Property with Unbounded Order 5.7 [83] In each case p x and p 2 are connected, so we can ignore, say, /? 3 . So the connectivity of the system has just two states, depending on the parity of the number of switches in down position, and these states can be represented as shown below. Figure 5.7 To prove our theorem we simply tie p x and p 2 together. Figure 5.8 It remains only to realize the details of the “4-switches.” Figure 5.9 illustrates the two configurations. Figure 5.9 [84] 5.7 Geometric Theory of Linear Inequalities Remember thatjy is not a connection. When the entire construc- tion is completed for n switches, the network will be about 5 n squares long and about 2n + 12 squares high, so that the number of switches can grow proportionally to |/?| l/2 . It follows that the order of ^connected grows at least as fast as | jR | l/2 . Figure 5.10 illustrates the complete construction. One must verify that there remain no “stray” connection lines that are not attached eventually to p 0 , p i, or p 2 . This can be verified by inspection of Figure 5.6. Furthermore, no closed loops are formed, other than the one indicated in left-hand part of Figure 5.8. Figure 5.1 1 A Geometric Property with Unbounded Order 5.8 [85] The idea for Theorem 5.7 comes from observing that in the planar version of Figure 5.4 (see Figure 5.1 1) we have p\ ++ q\ and p 2 ++ q 2 f 0 r one parity and p\ q 2 and p 2 ++ q\ for the other. If we could make a permanent additional direct connection from p\ to q\ then the whole net would be connected or disconnected according to the parity. But this is topologically impossible, and because the construction appeared incom- pleteable we took the long route through proving and applying the one- in-a-box theorem. Only later did we realize that the p \ «-► q { connection could be made “dynamically,” if not directly, by the construction in Figure 5.10. 5.7.2 The Order of ^ CONNECTED as a Function of \R\ What is the true order? Let us recall that at the root of the proof methods we used was the device (§5.0) of considering not all the figures but only special subclasses with special combinatorial features. Thus even the order of §5.6 is only a lower bound. Our suspicion is that the order cannot be less than \ |/?|. As for the number of (p's required, Theorem 3.1.2 and the toroidal results give us > 2 |/?l/l2 , but this too, is only a lower bound, and one suspects that nearly all the masks are needed. Another line of thought suggests that one might get by with an order like the logarithm of the number of connected figures, but that has prob- ably not much smaller an exponent. Examination of the toroidal construction in §5.6 might make one suspect that the result, ^connected > A 1^1 is an artifact resulting from the use of a long, thin torus. Indeed, for a “square” torus we could not get this result because of the area that would be covered by the connecting bridge lines. This clouds the conclusion a little. On the other hand, if we consider a //zree-dimensional R, then there is absolutely no difficulty in showing that ^connected > (1 /K) |/?|, for some moderate value of K. It is hard to believe that the difference in dimension could matter very much. 5.8 Topological Predicates We have seen that \X is connected] is not of finite order and we shall see soon that \X contains a hole] is also not of finite order. Curiously enough the predicate — fJA yo'iie 'hvCy Iwlr [^ - is connecte d — eft - — X conta i ns a hole} — see f [86] [5.8] Geometric Theory of Linear Inequalities - h - as finite order , e v en though neither disjunct - - does an instanc e— -e f the opposite of the And/Q r phcnomcnor rrThis will be shown by a construction involving the Euler relation for orientable geometric figures. 5.8.1 The Euler Polygon Formula Two-dimensional figures have a topological invariant* that in polygonal cases is given by B(X) = | faces (X) | - ledges (X)\ + | vertices (X)\. The sums under the examples in Figure 5.12 illustrate this formula by counting the number of faces, edges, and vertices, respectively. Use of the formula presupposes that a figure is cut into sufficiently small pieces that each “face” be simple — that is, contain no holes. It is a remarkable fact that B(X) will have the same value for any dissection of X that meets this condition. G--1 1-0 l-o M 2-1 m m i 2-1 Ml 1 O-l +2 l-<f-+V- 2-ll + JO Z-M+10 Or * 0 l-l □ 0-M- l-l an 1 -7+6 2-2 n a 0-7+7 A 2-2 0-7+7 G* = -1 1-2 m 0-7+6 2-3 2-3 n ra o-n-h/o ! o-n + fo ,tn; O-IH-IO Figure 5. 12 r In our context of figures made up of checkerboard squares, B(X) can be computed by a low-order linear sum G(X) defined as follows: G{X) = 2a/X/ + '2,a i jX i Xj + '2a ijk ix i x j x k xi, where *For our purposes here, a “topological invariant” is nothing more than a predi- cate that is unchanged when the figure is distorted without changing connected- ness or inside-outside relations of its parts. A Geometric Property with Unbounded Order 5.8 [87] a l = 1 a ij = ~ 1 OCijkl = 1 for each point H of R , for each adjacent pair for each square vertices edges faces G(X) and B(X) exactly agree on checkerboard figures without corner contacts like . When they disagree in such cases, our definition of connectedness requires the value of G{X). The importance of G(X) in our theory lies in the fact that al- though it is highly local — in fact, diameter limited and finite order — it is equivalent to the global formula E(X) = \components (X) \ - \holes(X)\. A component of a figure is the set of all points connected to a given point. A hole of a figure is a component of the complement of a figure. We assume that a figure is surrounded by an “outside” that does not count as a hole. Also, we have to define “corner contact” to be a connection, when dealing with a figure’s complement. Now we will prove that the local formula G(X) and the global formula E(X) are equivalent. First we will give a rather direct demonstration. Then in §5.8.2 we will give another kind of proof, based on deforming one figure into another, that will give a better insight into the proof of the main theorem of §5.9. Any figure X can be obtained by beginning with a one-square figure and adding squares one at a time. For a single square we have G(X) = E(X) = 1. Adding a square that is not adjacent to any square already in X adds 1 to G(X ), and (since it is a new component!) adds 1 to E(X). Adding a square adjacent to exactly one other square cannot change E(X), and adds exactly 0 - 1 + 1 = 0 to G(X). [88] 5.8 Geometric Theory of Linear Inequalities Three kinds of things can happen when one adds a square adja- cent to two others. When the new square fills in a corner, as in then 1 — 2 + 1 = 0 is added to G, leaving it unchanged, and neither is E(X) changed in this case. But when the new square connects two others that were not already connected together, as in then there is a net decrease of 1-2 + 0= -1 in G together with a decrease in E(X), because we have joined two previously separated parts. Finally, if the added square connects two squares that are already connected by some remote path, as in, then a region of space is cut off — a hole is formed, decreasing E by 1 and again the change in G is 1-2 + 0= -1. Case analyses of the 3-neighbor and 4-neighbor situations complete the proof: these include partial fills like which add 1 - 3 + 2 = 0 and 1 - 4 + 4 = 1. Notice that in the latter case, G is increased by one unit, as the hole is finally filled-in. In each case either G was unchanged , or the topology of the figure X was changed. (All this corresponds to an argument in algebraic topology concerning addition of edges and cells to chain-complexes.) This proves Theorem 5.8.1: E{ X) = G{X). It follows immediately that the predicate fG^) < n] is realized with order < 4. This leads to some curious observations: If we are given that the figures X are restricted to the connected (= one-component) figures then an A Geometric Property with Unbounded Order 5.8 [89] order-4 machine can recognize [A" has no holes] = \G(X) > 01 or r has less than 3 holes] = \G{X) > - 21. that there are nc? vJe, cah Yeco<j»n * Vcoumcfctfo But of course we cannot conclude that these can be recognized unconditionally by a finite-order perceptron. Note that this topological invariant is thus seen to be highly “local” in nature — indeed all the (p's satisfy a very tight diameter- limitation! Now returning to our initial claim we note that \G(X) = n] s (f G(X) < n] s \G{X) > nl). By Theorem 1.5.4 we can conclude that \G(X) = N] has order < 8. But the proof of that theorem involves constructing product-^’s that are not diameter-limited, and we show §8.4 that this predi- cate cannot be realized by diameter-limited perceptrons. 5.8.2 Deformation of Figures into Standard Forms The proof of Theorem §5.8.1 shows that G(X) will take the same values on any two figures X and Y that have the same value of E = |components | - |holes|. Now we will show that one can make a sequence of figures X , . . . , X ,-, . . . , T, each differing from its predecessor in one locality, and all having the same values for G = E. It is easy to see how to deform figures “smoothly” with- out changing G or E, in fact, without changing holes or compo- nents. For example, the sequence can be used to enlarge a hole. Now we observe that if a com- ponent C 0 lies within a hole H x of another component C 1? then C 0 can be moved to the outside without changing E{X) or G{X). Suppose, for simplicity, that C x touches the “outside” and that C 0 is “simply” in H x \ that is, there is no component C also enclosing C 0 [90] 5.8 Geometric Theory of Linear Inequalities Then C 0 can be removed from //i by a series of deformations in which, first, H\ is drawn to the periphery and then C 0 is temporarily attached: Notice that this does not change the value of G(X). Also, since it reduces both C and H by unity, it does not change E(X) = C{X) - H(X). We can then swing C\ around to the outside A Geometric Property with Unbounded Order 5.8 [91] and reconnect to obtain again without changing G(X) or E(X ). Clearly, we can eventually clear out all holes, by repeating this on each component as it comes to the outside. When this is done, we will have some num- ber of components, each of which may have some empty holes, and they can be all deformed into standard types of figures like Now by reversing the previous operation that took us from step 6 to 7, we can fuse any component that has a hole with any other component, for example, and thus one can reduce simultaneously both C and H until H is zero or C is unity. At this point one has either — n > components or - m - holes In each case one can verify that G(X) = E(X) = n [92] 5.9 Geometric Theory of Linear Inequalities or G(X) = E(X) = 1 - m. We will apply this well-known result in the next section. 5.9 Topological Limitations of Perceptrons Theorem 5.9: The only topologically invariant predicates of finite order are functions of the Euler number E(X) . The authors had proved the corresponding theorem for the diameter- limited perceptron, and conjectured that it held also for the order-limited case but were unable to prove it. It was finally established by Michael Paterson, and §5.9. 1 is based upon his idea. 5.9.1 Filling Holes Suppose that C(X) > 2 and H(X) > 1. Choose a hole H 0 in a component C 0 . Let Ci be a component “accessible” to C 0 , that is, there is a path P 0] from a boundary point of C 0 to a boundary point of C\ that does not touch X. Let Poo be a path within C 0 from a point on the boundary of hole H 0 to a point on another boundary of C 0 , such that poo and p 0 \ are connected. This is always possible even if C\ is within // 0 , outside C 0 com- pletely, or within some other hole in C 0 . Now, if \p(X) is a topologically invariant predicate, its value will not be changed when we deform the configuration in the follow- ing way: A Geometric Property with Unbounded Order 5.9 [93] Finally suppose that we were permitted to change the connec- tions in the box, to In effect, this would cut along P oo, removing a hole, and connect across one side of jP 01 , reducing by unity the number of com- ponents. Thus it would leave E(X) unchanged. [94] 5.9 Geometric Theory of Linear Inequalities Now we will show that making this change cannot affect the value of \p ! Suppose that \p has order k. Deform the figure {X) until the box contains a cascade of k + 1 “4-switches.” (See Figure 5.6.) This does not change the topology, so it leaves \p unchanged. Now consider the 2 k + ] variants of X obtained by the 2 k+{ states of the cascade switch. If \p has the same value for all of these, then we obviously can make the change, trivially, without affecting \p. If two of these give different values, say \p(X ') ^ \p(X "), then these must correspond to different parities of the switches, because \p is a topological invariant. But if this is so, then \p must be able to tell the parity of the switch, since all X's of a given parity class are topological equivalents (see details in §5.7). But because of the collapsing theorem, we know that this cannot be: \p must become “confused” on a parity problem of order > k. Therefore all figures obtained by changing the switches give the same value for \p, so we can apply the transformations described in §5.8.2 without changing the values of \p. 5.9.2 The Canonical Form We can use the method of §5.9.1 and §5.8.2 to convert an arbi- trary figure X to a canonical form that depends only upon E(X) as follows: we simply repeat it until we run out of holes or com- ponents. At this point there must remain either 1 . A single component with one or more holes, or 2. One or more simple solid components according to whether or not E(X) < 0. In case 1 the final figure is topologically equivalent to a figure like « m > holes with 1 — E(X) holes, while in case 2 the final figure is equivalent to one like components A Geometric Property with Unbounded Order 5.9 [95] with E(X) solid squares. Clearly, then, any two figures X and X' that have E(X) = E(X') must also have \p(X) = \p(X'). This com- pletes the proof of Theorem 5.9 which says that \p(X) must depend only upon E(X). remark: There is one exception to Theorem 5.9 because the canonical form does not include the case of the all-blank picture! For the predicate \X is nonempty] is a topological invariant but is not a function of E(X)\ See §8.1.1 and §8.4. There are many other topological invariants besides the number of com- ponents of X , and G(T), for example, [a component of X lies within a hole within another component of X}. The theorem thus includes the corollary that no finite-order predicate can distinguish between figures that contain others (left below) and those (right below) that don’t. Problem: What are the extensions of this analysis to topological predicates in higher dimensions? Is there an interpretation of 2a, <^, as a cochain on a simplicial complex, in which the thresh- old operation has some respectably useful meaning? /hat is composed onltj Closzd CWv/es is Conjonc.fi v/e,(vj \oc* l ^ cJ/qwietev~ Inn ife<f is i tin vicco of tU*. etlftfiifu ot uoumo r.Li IdisriA 4-?> rleof reotonal It nc? not* u/cu UJ Hat Hot r cluUvcyi to deal rec&onahly fttl £acU y \cal concepts. For tin's sitoujS o ’’-fopolo <j i cal j owe te fk* occ orawce <4- ends or break’s. &CX) - 6-ra,^) Geometric Patterns of Small Order: Spectra and Context 6 6.0 Introduction to Chapters 6 and 7 In Chapters 6 and 7 we study predicates that are more strictly geometric than is connectedness. An example typical of the prob- lems discussed is to recognize all translations of a particular figure or class of figures. In one sense the results are more positive than those of the previous chapter. Many such problems can be solved using low-order perceptrons, and the two chapters will be organized around two techniques for constructing geometrical predicates whose orders are often surprisingly small. The technical content of this introduction will be partly incom- prehensible until after Chapter 7 is read. It is intended, if read in the proper spirit, to provide an atmosphere enveloping this series of results and observations with a certain coherence. Whenever we can apply the group invariance theorem, the study of invariant predicates of small order reduces to the study of a few kinds of elementary local predicates. The bigger the group, the smaller and simpler becomes this set of elementary predicates. Because ^ PAR , TY is invariant under the biggest possible group (namely, all permutations) we were able to use for the elementary predicates the simple masks, classified merely according to the sizes of their support sets. Interesting geometric predicates will not survive such drastic transformations. Groups such as transla- tions or general rigid motions , lead to more numerous equiva- lence-types of partial predicates. Figures satisfying invariant predicates will nevertheless be characterized entirely by the sets of numbers which tell us how many of each type of partial predicate they satisfy. We shall call these sets spectra and show in Chapter 6 how to use them. Chapter 7 will center around a very different technique for con- structing geometric predicates. Whenever the group can be ordered in an appropriate way we can stratify the set of figures equivalent, under the group, to a given one, by the rank order of the group element necessary to effect the transformation. We can thus (in many interesting cases) split the recognition problem into two parts: recognize the stratum to which the figure belongs and then apply a simple test appropriate to the stratum. This description has an air of serial rather than parallel computation and, indeed, part of its interest is that it shows at least one way of simulating a serial, or conditional , operation using a parallel procedure. Geometric Patterns of Small Order 6.0 [97] Naturally a price has to be paid for this simulation. Our method of achieving it leads to extremely large coefficients in the linear representations obtained. Taken in itself this does not exclude the possibility of some other procedure achieving the same result more cheaply. We are therefore led (in Chapter 10) to a new area of study — the bounds on coefficient sizes — and to some in- triguing, though as yet only partially understood, results. We recall that our proof of the group-invariance theorem as- sumed that the group was finite. The ordering we use in strati- fication assumes that the group is infinite: for example, the translations on the infinite plane are ordered in the obvious way, but this becomes impossible if the group is made finitely cyclic by the toroidal construction described in §5.6. When we first ran into this conflict, the techniques of stratification and those related to group invariance (spectra, etc.) seemed to be strictly disjoint areas of research. But further study brought them together in a possibly deep way. We can in fact rescue the group-invariance theorem in some infinite cases by assuming that the coefficients are bounded. For example, suppose that \p(X) is a predicate defined for finite figures, X , on the infinite plane and is invariant under the group of translations. Then it can be expressed as an infinite linear form, for example, UX) 'Ea^(X) > e where <J> is an infinite set (for example, the masks) chosen so that for any finite X all but a finite number of terms in the sum will vanish. Now, if we know that the a ^ are bounded we can use (by Theorem 10.4.1) the group-invariance theorem. In some par- ticular cases this yields an order greater than that obtained by stratification. The contradiction can be dissipated only by con- cluding that the coefficients a ^ cannot be bounded for any low- order representative. It follows that the largeness of the coeffi- cients produced by our stratification procedure is not merely an accidental result of an inept algorithm (though, of course, the actual values might be; we have not shown that they are minimal). We were, of course, delighted to find that what seemed at first to be a limitation of our favorite theorem could be used to yield a valuable result. But our feeling that the situation is deeply in- [98] 6.0 Geometric Theory of Linear Inequalities teresting goes beyond this immediate (and practical) problem of sizes of coefficients. It really comes from the intrusion of the global structure of the transformation group. For a long time we believed that the recognition of all translations of a given figure was a high-order problem. Stratification showed we were wrong. But we have not been able to find low-order predicates for the corresponding problem when the group contains large finite cyclic subgroups such as rotations or translations on the torus, and we continue to entertain the conjecture that these are not finite- order problems. Complementing the positive results of Chapter 6 will be found one negative theorem of considerable practical interest. This con- cerns the recognition of figures in context. It is easy to decide, using a low-order predicate, whether a given figure is, say, a rectangle. The new kind of problem is to decide whether the figure contains a rectangle and, perhaps, something else as well (see Figure 6.1). It seems obvious that recognition-in-context should be somewhat harder, perhaps requiring a somewhat higher order. We shall show (§6.6) that it is worse than that: it is not even of finite order! Finally it should be noted that we manage, once more, to avoid the need to use a tolerance theory to escape from the limitations of our square-grid arrays. The translation group does not raise this problem. The rotation group does; but we say all we have to say in the context of 90° rotations. The similarity group suggests the most serious difficulties: one can dilate a figure easily enough, but how can one contract a small one? As it happens we have nothing interesting to say about this group. We urge future workers to be less cowardly. Geometric Patterns of Small Order 6.2 [99] In §6. 1 — §6.4 we begin by showing that certain patterns have orders = 1, = 2, < 3, < 4, respectively. In most cases we usually have not established the lower bound on the orders and have no systematic methods for doing so. 6.1 Geometric Patterns of Order 1 When we say “geometric property” we mean something invariant under translation, usually also invariant under rotation, and often invariant under dilation. The first two invariances combine to define the “congruence” group of transformations, and all three treat alike the figures that are “similar” in Euclidean geometry. For order 1 we know that coefficients can be assumed to be equal.* Therefore, the only patterns that can be of order 1 are those defined by a single cut in the cardinality or area of the set: 4/ = r \X\ > A] or i = \ \X\ < A 1. Note: If translation invariance is not required, then perceptrons of order 1 can, of course, compute other properties, for example, concerning moments about particular points or axes. (See §2.4. 1 .) However, these are not “geometric” in the sense of being suitably invariant. So while they may be of considerable practical im- portance, we will not discuss them further. t 6.2 Patterns of Order 2, Distance Spectra For k = 2 things are more complicated. As shown in §1.4, Ex- ample 3, it is possible to define a double cut or segment \Ai < A < A 2 1, in the area of the set and recognize the figures whose areas satisfy 4 = Ml < \X\ < All In fact, in general we can always find a function of order 2k that recognizes the sets whose areas lie in any of k intervals. But let us * All the theorems of this chapter assume that the group invariance theorem can be applied, even though the translation group is not finite. This can be shown to be true if (Theorem 10.4) the coefficients are bounded; it can be shown in any case for order 1, and there are all sorts of other conditions that can be sufficient. In §7.10 we see that the group-invariance theorem is not always available. We do not have a good general test for its applicability. Of course, the coefficients will be bounded in any physical machine! tSee, for example, Pitts and McCulloch [1947] for an eye-centering servo- mechanism — using an essentially order- 1 predicate. [100] 6.2 Geometric Theory of Linear Inequalities return to patterns with geometric significance. First, consider only the group of translations, and masks of order 2. Then two masks X\X 2 and x\ x{ are equivalent if and only if the difference vectors x , —^ 2 and *i - x'l are equal, with same or opposite sign. Thus, with respect to the translation group, any order-2 predicate can depend only on a figure’s “difference-vector spectrum,” defined as the sequence of have the same difference-vector spectra, that is, “vector” number of pairs 4 1 2 1 1 1 Hence no order-2 predicate can make a classification which is both translation invariant and separates these two figures. In fact, an immediate consequence of the group-invariance theorem is: Theor em 6.2: Let \p(X) be a translation-invariant predicate of order 2. Define n„(X) to be the number of pairs of points in X that are separated by the vector v. Then \p(X) can be written Geometric Patterns of Small Order 6.2 [101] UX) yi a v n v (X) > 0 proof: n v predicates in the class are satisfied by any trans- lation of X. By Theorem 2.3 they can all be assigned the same coefficient. Corollary: Two figures with the same translation spectrum n(v) cannot be distinguished by a translation-invariant order-2 percep- tron. (But see footnote to §6.1 .) Conversely if the spectra are different, for example n V{ (A) < then the translations of two figures can be separated by \n Vx {X) < n, x (B)]. But classes made of different figures may not be so separable. Example: the figures are indistinguishable by order-2 predicates, while have different difference-vector spectra and can be separated. If we add the requirement of invariance under rotations, the latter pair above becomes indistinguishable, because the equivalence- classes now group together all differences of the same length, whatever their orientation. Note that we did not allow reflections, yet these reflectionally opposite figures are now confused! One should be cautious about using “intuition” here. The theory of general rotational in- variance requires careful attention to the effect of the discrete retinal approximation, but could presumably be made consistent by a suitable tolerance theory; for the dilation “group,” there are serious difficulties. (For the group generated by the 90° rota- tions, the example above fails but the following example works.) [102] 6.2 Geometric Theory of Linear Inequalities An interesting pair of figures rotationally distinct,, but neverthe- less indistinguishable for k = 2, is the pair and which have the same (direction-independent) distance-between- point-pair spectra through order 2, namely, | Xi - Xj | =1 from 4 pairs | Xi - Xj | = a/ 2 from 2 pairs \x( - Xj | =2 from 2 pairs I*/ - Xj | = a/ 5 from 2 pairs and each has 5 points (the order-1 spectrum). The group-invariance theorem, §2.3, tells us that any group- invariant perceptron must depend only on a pattern’s “occupancy numbers,” that is, exactly the “geometric spectra” discussed here. Many other proposals for “pattern-recognition machines” — not perceptrons, and accordingly not representable simply as linear forms — might also be better understood after exploration of their relation to the theory of these geometric spectra. But it seems unlikely that this kind of analysis would bring a great deal to the study of the more “descriptive” or, as they are sometimes called, “syntactic” scene-analysis systems that the authors secretly advo- cate. Another example of an order-2 predicate is \X lies within a row or column and has < n segments] which can be defined by [2 a + 2 a - 2 a + (all non-collinear pairs) < n], EM 1 jiff ESS Geometric Patterns of Small Order 6.3 [103] 6.3 Patterns of Order 3 6.3.1 Convexity A particularly interesting predicate is i/' convex 00 = [A" is a single, solid convex figurel. That this is of order < 3 can be seen from the definition of “con- vex”: X is convex if and only if every line-segment whose end points are in X lies entirely within X. It follows that X is convex if and only if a e X and be X midpoint ([a,b]) e X; hence CONVEX (X) [midpoint [a,b] not in X] < 1 all a, b in X has order < 3 and presumably order = 3. This is a “conjunc- tively local” condition of the kind discussed in §0.2. Note that if a connected figure is not convex one can further con- clude that it has at least one “local” concavity, as in with the three points arbitrarily close together. Thus, if we are given that X is connected, then convexity can be realized as diameter-limited and order 3. If we are not sure X is connected, then the preceding argument fails in the diameter-limited case because a pair of convex figures, widely separated, will be ac- cepted: [104] 6.3 Geometric Theory of Linear Inequalities Indeed, convexity is probably not order 3 under the additional restriction of diameter limitation, but one should not jump to the conclusion that it is not diameter-limited of any order, because of the following “practical” consideration: Even given that a figure is connected, its “convexity” can be de- fined only relative to a precision of tolerance. In addition figures must be uniformly bounded in size, or else the small local toler- ance becomes globally disastrous. But within this constraint, one can approximate an estimate of curvature, and define “convex” to be /(curvature) < 4t. We will discuss this further in §8.3 and §9.9. 6.3.2 Rectangles Figure 6.2 Some “hollow” rectangles. Within our square-array formulation, we can define with order 3 the set of solid axis-parallel rectangles. This can even be done with diameter-limited <p' s, by ^ Vrm — 41 where all <£>’s equivalent under 90° rotation are included. The hollow rectangles are caught by < Geometric Patterns of Small Order 6.3 [105] where the coefficients are chosen to exclude the case of two or more separate points. These examples are admittedly weakened by their dependence on the chosen square lattice, but they have an underlying validity in that the figures in question are definable in terms of being rectilinear with not more than four corners, and we will discuss this slightly more than “conjunctively local” kind of definition in Chapter 8. One would suppose that the sets of hollow and solid squares would have to be of order 4 or higher, because the comparison of side-lengths should require at least that. It is surprising, therefore, to find they have order 3. The construction is distinctly not con- junctively local, and we will postpone it to Chapter 7. 63.3 Higher-order Translation Spectra If we define the 3-vector spectrum of a figure to be the set of numbers of 3-point masks satisfied in each translation-equiva- lence class, it is interesting to note the following fact (which is about geometry, and not about linear separation). Theorem 6.3.3: Figures are uniquely characterized (up to trans la- tion) by their 3-vector spectra, even in higher dimensions. Figure 6.3 [106] 6.3 Geometric Theory of Linear Inequalities The proo-f shows "Hie IcwcjesT vectors •Hvarr caa be ivrsc\f i bed! \n ^ ^i^uve. a. ire u/u^oc in &<xck cLi\recha/i: no of he\r \iec-bo\r can. be pav<3.\(el 4 <mc! e^o<il in c*_ l onc^es't \/ecf(?r proof: Let X be a particular figure. The figure A" has a maximal distance D between two of its points. Choose a pair ( a,b ) of points of X with this distance and consider the set <b ab = \<p a .b,x\ of masks of support 3 that contain a,b and any other point x of X. Each such mask must have coefficient equal to unity in the translation spectrum of X , for if X contained two translation-equivalent masks - *P-t a>.. r- ^fld V * 1 gT then one of the distances [a,gb] or [ ga,b ] would exceed Z), for they are diagonals of a parallelogram with one side equal to D (see Figure 6.3). Thus any translation of X must contain a unique translation of ( a,b ) and the part of its spectrum corresponding to allows one to reconstruct X completely (see Figure 6.4). b Figure 6.4 The fact that a figure is determined by its 3-translation spectrum does not, of course, imply that recognition of classes of figures is order 3. (It does imply that the translations of two different figures can be so separated. In fact, the method of §7.3, Applica- tion 7, shows this can be done with order 2, but only outside the bounded-coefficient restriction.) 6.4 Patterns of Order 4 and Higher We can use the fact that any three points determine a circle to make an order-4 perceptron for the predicate \X is the perimeter of a complete circle] Geometric Patterns of Small Order 6.5 [107] by using the form X X a x b x c x d -f X x a x b x c x d < 1 , di^abc de C abc where C abc is the circle* through x a , x b , and x c . Many other curious and interesting predicates can be shown by similar argu- ments to have small orders. One should be careful not to conclude that there are practical consequences of this, unless one is pre- pared to face the facts that 1. Large numbers of (p's may be required, of the order of for the examples given above. 2. The threshold conditions are sharp, so that engineering con- siderations may cause difficulties in realizing the linear summa- tion, especially if there is any problem of noise. Even with simple square-root noise, for k = 3 or larger the noise grows faster than the retinal size. The coefficient sizes are often fatally large, as shown in Chapter 10. 3. A very slight change in the pattern definition! often destroys its order of recognizability. With low orders, it may not be pos- sible to define tolerances for reasonable performance. 6.5 Spectral Recognition Theorems A number of the preceding examples are special cases of the following theorems. (The ideas introduced here are not used later.) The group-invariance theorem (§2.3) shows that if a predi- cate \p is invariant with respect to a group G, then if \p e L(<£) for some it can be realized by a form * = Z AW) > 0 where N, is the number of (p's satisfied by X in the /th equivalence class. In §5.2 we touched on the “difference vector spectrum” for geometric figures under the group of translations of the plane. *Again there is a tolerance problem: what is a circle in the discrete retina? See §8.3. fOur formula accepts (appropriately) zero- and one-dimensional “circles.” This phenomenon cannot be avoided, in any dimension, by a conjunctively local predicate. [108] 6.5 Geometric Theory of Linear Inequalities Those spectra are in fact the numbers N f (X) for order 0, 1, and 2. If a G-invariant \p cannot be described for any condition on the N-s for a given <i>, then obviously \p is not in L(<I>). The following results show some conditions on the N f that imply that \p is of finite order. Suppose that \p is defined by m simultaneous equalities: WX) s \N\(X) = /z, and N 2 (X) = n 2 and ... N m {X) = #ij. where n u . . . , n m is a finite sequence of integers. The order of \p is not more than twice the maximum of the orders of the (p's asso- ciated with the A/'/’s. We will state this more precisely as Theorem 6.5: Let * = *1 U.$2 U ••• U and N,(X) = \{<p \ <p 6 $ , AND <p(X) = 1}| = X *(*)• Then the order of \p(X) = \Ni(X) = n { , for 1 < i < m] is at most twice max {|S(y)|; <pe$ j. The goal of the proof is to show that the definition of \p can be put in the form of a linear threshold expression, namely, WX) = [2(W) - n,) 1 < 11. As it stands this is not a linear threshold combination of predi- cates. To recast it into the desired shape we introduce an ad hoc convention that will not be used elsewhere. Given any set <I> of predicates we construct a new set of predicates $ 2 by listing all pairs of (<£>,, (pj) of predicates in # and defining <Pij(X) = (PXX) a <pj(X). Geometric Patterns of Small Order 6.5 [109] Many of the predicates so constructed will be logically equivalent, for example, = <p y7 , but we make the convention that these are to be counted as distinct members of <J> 2 . (This means that in a very strict sense 3> 2 is a set of “predicate forms” rather than of predicates.) The effect of the convention is to simplify the arithmetic and logic of the counting argument we are about to use. Let L be a figure for which exactly N predicates in are satisfied. Obviously N 2 predicates of $ 2 will be satisfied by X , that is, Z = N 2 . $2 Now let <i>i, <E> 2 , . . . , be an enumeration of the equivalence classes of Since the number of predicates of <£, satisfied by X is n,{X) = Z #>(*); if- then, as we have seen, Z <P(X) = N}{X). Thus Z Z *oo - <p(x) + /i? > = Z iwoo - «/) To represent the left-hand side of this equation in the standard linear threshold predicate form we define <£' = <J> 2 U $ U {the constant predicate}, and write *00 Z«(*m*) < i where £*(<£>) = 1 for (p e <£ 2 a(<p) = — 2 / 2 / for <pe$i a (constant) = Z« 2 . [110] 6.5 Geometric Theory of Linear Inequalities To complete the proof of the theorem we have only to observe that |S(*v)l = |S(*,) U S(^)| < 1*5 (^/) I + l*s , (^)l < 2 (max |S(^)|) Q.E.D. 6.5.1 Extended Exact Matching An obvious generalization of Theorem 6.5 is this: Suppose that \p is defined by n m H*) ■ VA (Nj(X) = n tJ ) 9 f-iy-i that is, \p satisfies any one of a number of exact conditions on the TV,-. Then \p is of finite order, for we can realize the polynomial form n m n e w - "u ) 2 i = 1 j= 1 by methods like those in the previous paragraph. However, the extension now requires Boolean products of predicates of differ- ent equivalence classes, and the maximal order required will be <2 n • max |£(v?)|. Note that if one were not aware of the And/Or phenomenon, one might be tempted to try to obtain §6.5.1 from §6.5 via the false conjecture n V (predicates of order k) is order < nk. 6.5.2 Mean-Square Variation In the expressions for the predicates discussed in §6.5.1, we could increase 0 to higher values: [E(TV ; - n f ) 2 < 0}. Then the system will accept exactly those figures for which the sum of the squares of the differences of the TV’s and the n- s are Geometric Patterns of Small Order 6.6 [111] bounded by 6. Any pattern-classification machine will be sensitive to certain kinds of distortion, and this observation hints that it might be useful to study such machines, and perceptrons in par- ticular, in terms of their spectrum-distortion characteristics. Un- fortunately, we don’t have any good ideas concerning the geometric meaning of such distortions. The geometric nature of this sort of “invariant noise” is an interesting subject for specu- lation, but we have not investigated it. 6.6 Figures in Context For practical and theoretical reasons it is interesting to study the recognition of figures “in context”: like, f(X) = fa subset of X is a square], \p(X) = [a connected component ofX is a square], or, to begin to consider three-dimensional projection problems, \p(X) = \X contains a significant portion of the outline of a partially obscured square]. The examples show that there is more than one natural meaning one could give to the intuitive project of recognizing instances of patterns embedded in contexts. We do not know any general definition that might cover all natural senses, and are therefore unable to state general theorems. We do, nevertheless, claim that the genera l rule is for low-order predicates to lose their property of finite order when embedded in context in any natural way. To illustrate the thesis we shall pick a particularly common and apparently harmless interpretation: For any predicate f(X) define a new one by f in context {X) = [ \p(Y) for some connected component of X). [112] 6.6 Geometric Theory of Linear Inequalities It will be obvious that the techniques we use can be adapted trivially to many other definitions. Intuitively, we would expect ^ IN context t° be much harder for a perceptron since the context of each component acts as noise and the parallel operation of the device allows little chance for this to be separated and ignored. The point appears particularly clearly in the cases where \p uses rejection rules. These cannot be trans- ferred over to \(/ w context f° r very obvious reasons. Similarly, we will lose the stratification methods of Chapter 7 and, indeed, most of our technical tricks used to obtain low-order representations of predicates. The next two theorems show how this intuitive idea can be given a rigorous form. It should, however, be observed that no simple generalization is possible about the relation of \p to ^in context since some i/^’s become degenerate in context. For example, ^connected becomes degenerate in context because every set has a connected component! Theorem 6.6.1: Let R be a finite square retina and let \p{X) be \f/(X) = \X is exactly one horizontal line across the retinal. Then \p is of order 2 but \p lN context is not of finite order. proof: We leave as an exercise the proof that \p as defined has order 2. To show that ^ IN context is not of finite order we merely observe that it is the negation of the negative of the one-in-a-box predicate, \p \ , namely the predicate that asserts there is no horizontal white line across the retina. Its negative (in the photographic sense) asserts that there is no horizontal black line. Now \p { is not of finite order, and one can show in general that the same is true of any such predicate's negative. Finally, by reversing the predicate’s inequality we find the same is true for the desired i/'in context = \X contains a horizontal line across the retinal. Theorem 6.6.2: Let \p(X) be \X is a hollow square!. Geometric Patterns of Small Order 6.6 [113] Then \f/ lN context is not of finite order: [One component of X is a hollow square!. Mill! - — TT - — iL M 1 1 Li- proof: The proof is exactly the same as the previous except that the “boxes” or horizontal lines are folded into squares and mapped without overlap into a larger retina. Again, it can be shown that \p itself is of finite order; in this case, order 3. Note: An alternative proof method is to fold the lines of switch- ing elements used in the Huffman construction for connectivity (§5.5). It is our conviction that the deterioration of the perceptron’s ability to recognize patterns embedded in other contexts is a serious deterrent to using it in real, practical situations. Of course this deficiency can be mitigated by embedding the perceptron in a more serial process — one in which the figure of interest is isolated and separated from its context in an earlier phase. Bui this pre- supposes enough recognition ability, in the “preprocessing” phase, to discern and remove the most commonly encountered contextual disturbances, and this may be much harder than the “processing” phase. We treat this further in Chapter 13. Stratification and Normalization 7 7.1 Equivalence of Figures In previous chapters we discussed the recognition of patterns — classes of figures closed under the transformations of some group. We now turn to the related question of recognizing the equiva- lence, under a group, of an arbitrary pair of figures. The results below were surprising to us, for we had supposed that such prob- lems were not generally of finite order. A number of questions remain open, and the superficially positive character of the fol- lowing constructions is clouded by the apparently enormous co- efficients they require, and the manner in which they increase with the size of the retina. A typical problem has this form: The retina* is presented as two equal parts A and B and we ask: is the figure in part B a rigid translation of the figure in part A? More gen- erally, is there an element g from some given group G of trans- formations for which the figure in B is the result of g operating on the figure in A? What order predicates are required to make such distinctions? The results of this chapter all derive from use of a technique we call stratification. Stratification makes it possible, under certain conditions, to simulate a sequential process by a parallel process, in which the results are so weighted that, if certain conditions are satisfied, some computations will numer- ically outweigh effects of others. The technique derives from the following theorem: * All the theorems of this chapter apply directly to perceptrons on infinite retinas; that is, without having to consider limiting processes on sequences of finite retinas as proposed in §1.6. The transformation groups, too, are infinite, and the group-invariance theorem is not used. Because this material is somewhat more specialized than the rest, we will regress a little toward the conventional and hideous style of mathematical exposition, in which theorems are stated and proved before explaining what they are for. Stratification and Normalization 7.2 [115] pr^v/cj cl [$<xc\\fous> -fyv inwb 'fe'iAQVSj ujlw^k i$ c<_ becaos-e. tke. basic 'i^e<c is Sc? 5/wpte. We Uci v/<=- +Wc^o/e cxtld'ed } ok pot^e. <*- nuto-ertcal ^exawtp le 4o il/os^tfe tVe fetluo l’g^ <*m<4 ){■ ^ o0L/ uvvclcv^V^ti^ h<?co 'ib toav'ks Mdtc s^toold boutr V\<? "firooU l-C si/ «^»p §73. 7.2 The Stratification Theorem Let II = { 7r j , 7 t 2 , . . . , 7c j , . . . } be a sequence of different masks and define a sequence C\, . . . , C 7 , . . . , of classes by ItC 7 ^=> [tt ; (X) and (A: > j =► ~tt* (*))]. Thus A" is in C ; if j is the highest index for which 7 r,(T) is true, as is illustrated below. 7T, = ? 7Ti = ? 3 11 0 / / ^7T 2 = 1 7 r 2 = ? 7T 3 — 0 3 11 / 7 L_ / H / _ 7T 4 = 0 7T 4 = 0 7T 4 = 0^ V Cl Ci c 3 J Figure 7. 1 The partition into C, (x). Let <J> = j^/j be a family of predicates and let \p \, . . . , 1 / 7 , be an ordered sequence of predicates in L(<£>) that are each bounded in the sense that for each \pj there is a linear form 2 y with integer coefficients such that 2; = X OCijiPi - 6j AND 1 pj = fs, > 01 / and a bound Bj such that |2;(*)| < Bj for all finite \X\. (The proof actually requires only that each 1 2; (A") | be bounded on each C k .) Theorem 7.2: The predicate \p(X) = \XeC, =* \pj(X)] obtained by taking on each C 7 the values of the corresponding ^ 7 , lie s m L(<J> • II); that is, it can be written as a form HX) = Woijfaj A<p k ) > 61 [116] 7.2 Geometric Theory of Linear Inequalities proof: It is easy to see that every finite X will lie in one of the Cj. Define S\ = 7T 1 * 2 1 , and for j > 1 define inductively { Mj = max | Sj^i | , Sj = Sj _ i — ttj Mj 4- (2 Mj + 1 ) ■ 7Ty * 2 y . The bounds Bj assure the existence of the M/s. Now write the formal sum generated by this infinite process as S = Hotjki'Kj A <p k ), and we will show that \p(X) = [5(2f) > 01. The infinite sum is well-defined because for any finite X in any Cj there will be only a finite number of nonzero 7 r 7 - A terms. Base : It is clear that if X is in C i then S x = so \p(X) = [5i(2f) > 01. Induction: As- sume that if X is in Cy_ j then \^(X) = [S'y.iCX) > 01. Now the coefficients are integers, so if XeCj, ttj = 1 and Jif \f/(X) then > 1 so 5, > [-Mj - Mj + 2 Mj + 1] = 1, [if ~$(X) then 2, < 0 so Sj < [Mj - M )) = 0. Q.E.D. Corollary 7.2: The order of \p(X) is no larger than the sum of the m aximum [support! in $ and the maximum [support! in II. This follows because the predicates in $ occur only as conjuncts with predicates in II. The idea is that the domain of \p(X) is divided into the disjoint classes or “strata,” C y . Within each stratum the — tv j M ; term is so large that it negates all decisions made on lower strata, unless the \pj test is passed. In all the applications below, the strata represent, more or less, the different possible deviations of a figure from a “normal” position. Hence there is a close connection be- tween the possibility of constructing “stratified” predicates, and the conventional “pattern recognition” concept of identifying a Stratification and Normalization 7.3 [117] figure first by normalizing it and then by comparing the normal- ized image with a prototype. This, of course, is usually a serial process. It should be noted that predicates obtained by the use of this theorem will have enormous coefficients, growing exponentially or faster with the stratification index j. Thus the results of this chapter should not be considered of practical interest. They are more of theoretical interest in showing something about the rela- tion of the structure of the transformation groups to the orders of certain predicates invariant under those groups. 7.3 Application 1 : Symmetry along a Line Let R = . . . , x s , . . . , be the points of an infinite linear retina, that is, - oo < s < oo ; it is convenient to choose an arbitrary origin x 0 and number the squares as shown: 0 XL, *0 X , □ Suppose that X is a figure in R with finite \X\. We ask whether A symmetrical ("^ 0 = f X has a symmetry under reflection! is of finite order. It should be observed that the predicate would be trivially of order 2 if the center of symmetry were fixed in advance. But Asymmetrical allows it to be anywhere along the infinite line. We will “stratify” Asymmetrical by finding sequences 7r!,...,and Ai,...,that allow us to test for symmetry, using the following trick: the tt/s will “find” the two “end points” of X and the corresponding \p-s will test the symmetry of a figure, assuming that it has exactly these end points. Our goal, then, is to define the tt/s so that each Cj will be the class of figures with a certain pair of end points. To do this we need ttj, . . . , to be an enumeration of all segments (x s ,x s+d ) for every s and for every d > 0, with the property that any term Cx 5 ,x J+£ /) must follow any term with 0 < a < b < d. There do indeed exist such sequences, for example: [118] 7.3 Geometric Theory of Linear Inequalities 7T, = X 0 X 0 7T 2 = X { Xi 7T3 = XqX i 7 r 4 = 7T 5 = X_!X 0 7T 6 = X_\X { 7 r 7 = x 2 x 2 7T 8 = XiX 2 0 L i _ 0 _ i -i 0 -l i 2_ i 2 0 2 -i 2 -2 -2 -i -2 0 Jlij i 3 2 1 3 1 2 3 i 3 0 3 It can be seen that (1) each segment occurs eventually, and (2) no segment is ever followed by another that lies within it. Therefore, if x s , x s+d are the extreme left and right points of X, then X will lie in precisely the C, for that {x s , x s+d ). Now define \f/j to be A j = = Xs+d-ii i ~ 0 , . . . , d\ or, equivalently, " d j — ^ 1 + — X s+d-i ) ^ •1 showing that it is a predicate of order 2 bounded by Bj = d + 1. So, finally, application of the stratification theorem shows that Asymmetrical has order <4, since the A’s have order <2 and the 7r’s have support <2. 7.4 Application 2: Translation-Congruence along a Line Let . . . , x 5 , . . . , and . . . , y t , . . . , be the points of two infinite linear retinas, that is, - oo < x s < <*> and — °o < y t < oo : Let X be a figure composed of a part X A in the left retina and a part X B in the right retina. We want to construct Stratification and Normalization 7.4 [119] ^translate(^) = fthe (finite) pattern in A is a translate of the pattern in B]. To “stratify” ^translate we have to find a sequence 7r ; that allows us to test, with appropriate i^/s, whether the A and B parts of X are congruent. We will do this by a method like that used in §7.2.1, but we now have to handle two segments simultaneously. That is, we need a sequence of 7r/s that enumerates all quadruples in such a way that a figure lies in C 7 if and only if the end points of its A and B parts are precisely the corresponding values of x s , x s+dx , y t , and y t+d There does indeed exist such a sequence (!), and one can be obtained from the w's of §7.2.1 as follows (the reader might first try to find one himself). Define ir jk to be the four-point mask obtained by t Tjk(X) = irj(X A ) • ir k (X B ), that is, by choosing according to i two points of A and according to j two points of B. The master sequence requires us to enumer- ate all Wi/s under the condition that no 7r ab can precede any 7r c jifbothtf > c and b > d. A solution is 5 5 —4 [120] 7.4 Geometric Theory of Linear Inequalities 7T 1 1 ? ^21 ? 12? ^22? ^*31 ? 32, 7Tl3? ^*23? ^"33? ^"41? ^*42? ^43? ^*14? ^"24? • • • » and for the w jk term in this sequence, an appropriate predicate ^(jk) is ip(jk) = [the segments defined by ttj and ir k have the same lengths, and the x's andy’s in those intervals have the same values at corresponding points]. This is an order-2 predicate, and bounded (by the segment lengths). The tt/s now have support 4, so ^translate 00 has finite order <6. Actually, having found both extrema of X A , it is neces- sary only to find one end of X B , so a slightly different construction using the method of §7.9 would show that the order of ’/'translate is <5. 7.5 Application 3: Translation on the Plane The method of application 2 can be applied to the problem of the two-dimensional translations of a bounded portion of the plane by using the following trick. Let each copy of the retina be an ( m x m) array. Arrange the squares into a sequence {x,} with the square at ( a , b ) having index ma + b. In effect, we treat the retina as a cylinder and index its squares so: This maps each half of the retina onto a line like that of applica- tion 2 in such a way that for limited translations that do not carry the figure X over the edge of the retina , translations on the plane are equivalent to translations on the line, and an order-5 predicate can be constructed. In §7.6 we will show how the ugly restriction just imposed can be eliminated! Application 4. 180° rotation about undetermined point on the plane. With the same kind of restriction, this predicate can be con- Stratification and Normalization 7.6 [121] structed (with order 4) from application 1 by the same route that derived application 3 from application 2. Similarly, we can detect reflections about undetermined vertical axes. 7.6 Repeated Stratification In the conditions of the stratification theorem, the only restriction on the i^/s is that they be suitably bounded. In certain applica- tions the \f/Js themselves can be obtained by stratification. This is particularly easy to do when the support of \pj is finite, for then boundedness is immediate. To illustrate this repeated stratifica- tion we will proceed to remove the finite restriction in application 3 of §7.5. First enumerate all the points of each of two infinite plane retinas A and B according to the more or less arbitrary pattern: Figure 7.2 to obtain two sequences x u . . . , x s , . . . , and y \, , . . ,y t , Now we will invoke precisely the same enumeration as in §7.4, but with the definition TT jk (X) = ( XjtX A and y/c e X B ) = Xj-y k . Then C {jk) is the class of pairs (X A ,X B ) for which Jj = max {six, « X A ] l k = max \t\y, t X B }. [122] 7.6 Geometric Theory of Linear Inequalities We need only a (bounded) \[/ uk) that decides whether X A is a translate of X B for figures in C Jk . But the figures in C jk all lie within bounded portions of the planes, in fact within squares of about [max (y, k )\ 1/2 on a side around the origins. Within such a square — or better, within one of twice the size to avoid “edge- effects” — we can use the result of application 3, §7.5, to obtain a predicate \[/ uk) with exactly the desired property, and with finite support! The resulting order is <5 + 2 = 7. We have another construction for this predicate of order < 5. The same argument can be used to lift the restrictions in application 4 of §7.5. 7.7 Application 5: The Axis-parallel Squares in the Plane We digress a moment to apply the method of the last section to show that the predicate i (A") = [A" is a solid (hollow) axis-parallel square], where the form may lie anywhere in the infinite plane, has order <3. Stratification and Normalization 7.7 [123] (We consider this remarkable because informal arguments, to the effect that two sides must be compared in length while the interior is also tested, suggest orders of at least 4. The result was dis- covered, and proven by another method, by our student, John White.) We enumerate the points x u . . . , of a single plane, just as in §7.6 and simply set it, = x,. Then C, is the set of figures whose “largest” point is x,. If A' is a square, the situation is like one of the cases shown in Figure 7.4. We then construct \pj by stratifying Figure 7.4 as follows: Let x J u x J 2 , . . . , x j nj be the finite sequence obtained by stepping into the spiral figure orthogonally from Xj. Define 7 rj = x{ so that Cj will contain all the squares of length i on a side that are “stopped” by x r But there is only one such square, call it Sj. So to complete the double stratification we need only provide predicates \p J ,to recognize the squares S J ,. But this can be done by \l/ J i = [2 a k x k > i 2 1 where C 1 if Xk t S J i a k =< -1 if x k j S{ A k < j 1=0 otherwise [124] 7.7 Geometric Theory of Linear Inequalities Then \p{ is of order 1 . So \p ^ has order < 3 ! Q.E.D. 7.8 Application 6: Figures Equivalent under Translation and Dilation Can a system of finite order recognize equivalence of two arbi- trary figures under translation and size change? Some reflection about the result and methods used in §7.6 and §7.7 will suggest that we have all the ingredients, for §7.6 shows how to handle translation, and §7.7 shows how to recognize all the translations and dilations of a particular figure. Now dilation involves serious complications with tolerance and resolution limits, in so far as our theory is still based on a fixed, discrete retina, and we do not want to face this problem squarely. None the less, it is interesting that the desired property can at least be approximated with finite order, in an intuitively suggestive fashion. (We do not think that a similar approximation can be made in the case of rotation invariance, because the problem there is of a different kind, one that cannot be blamed on the discrete retina. Rather, it is because the transformations of a rotation group cannot be simply ordered, and this “blocks” stratification methods.) Our method begins with the technique used in §7.6 to find predi- cates 7 r ijk) that “catch” the two figures in boxes. Then, just as in §7.6, the problem is reduced to finding predicates that need only operate within the boxes of Figure 7.2. We construct the ip ijk ) s by a brutal method: within each box we use the simple enumeration of points described in §7.5. Then we stratify four times (!) in succession with respect to x = highest and leftmost point of A, Stratification and Normalization 7.9 [125] y = highest and leftmost point of B, x' = lowest and rightmost point of A, y’ = lowest and rightmost point of B. We will need to define predicates \px jk yy for this. If the two vectors x - jc' and y - y' do not have the same direction we set \p = 0; otherwise we need a \p to test whether or not for every vector displacement v \x - x 1 | y + v = x + • v, \y - y' I and this is an order-2 predicate, leading finally to total order <2 + 4 + 2 = 8. Of course, on the discrete retina the indicated operations on vectors will be ill-defined, but it seems clear that the result is not vacuous: for example, we could ask for recogni- tion of the case where X B is a translate and an integer multiple of X A in size, with each black square of X A mapping into a correspondingly larger square in X B . We have another construc- tion for this predicate of order < 6. 7.9 Application 7: Equivalents of a Particular Figure In constructing \p for application 5, we noted that one can always construct an order- 1 predicate to detect precisely one particular figure X 0 by using \2 X€ x 0 x + > 11. It follows that if we can construct a stratification { 7T, } for a group G such that X e Ci AND gX 6 Cj =» ( gX = X ), then we can recognize exactly the (/-equivalents of a given figure T 0 (with one order higher than the order used by the stratification 7 r’s). This is suggestive of a machine that brings figures into a normal form in the first stage of a recognition process. For this case our general construction method takes a very simple form: Consider a particular figure X 0 consisting of the ordered sequence of points { *,• , . . . , x ip } on the half-line [126] 7.9 Geometric Theory of Linear Inequalities Let ttj(X) = \xj6 X] and define ypj{X) as x k _ j+ip eX 0 ]x k + T,\xk-j+i p eX 0 and k < j]x k < l"j ignoring for the moment points with negative indices. Then, except for “edge effects” we obtain a predicate of order 2 that recognizes precisely the translates of X 0 . Next we observe that there is really no difficulty in extending this to the two-way infinite line, for we can enumerate the 7r/s in the order Cl * 3 1 *l| *o| *il * 2 | Z 3 \ [7 i ’ ' ' Ifj TTj TTj_ tt 3 tt 5 tr 7 ■ • 1 so that if a figure ends up in class C 2 j we will have found its leftmost point X_ j9 and if it is in a C 2 j+\ we will have found its rightmost point Xj. In either case we can construct an appropriate yp. Hence, finally, we see that there exists for any given figure X 0 a predicate of order 2 that recognizes precisely the linear translations of X 0 , and there is no problem about boundedness because all \p supports are finite. *-2 *0 *1 *2 X 3 ** 1 *3 *5 *7 7.10 Apparent Paradox Consider the case of X 0 We have just shown that there exists a \p of order 2 that accepts just the translate s of this fi gure. Hence \p must reject the non- equivalent figure, But both of these figures have exactly the same w-tuple distribution spectrum (see §6.2 and §6.5) up to order 2! Each has 3 points, and each has 1 adjacent pair, 1 pair two units apart, and 1 pair 3 units apart. Therefore, if all group- equivalent (p's had the same weights, a perceptron of order > 3 would be needed to distinguish them. Thus if we could apply the group-invariance theorem we would in fact obtain a proof that no perceptron of order 2 can distinguish between these. This would be a contradiction! What is wrong? The answer is that the group- invariance theorem does not in general apply to predicates in- variant under infinite groups. When a group is finite, for example, the cyclic translation group of the toroidal spaces we have con- sidered from time to time, one can always use the group-invariance theorem to make equal the coefficients of equivalent (p's. But we cannot use it together with stratification to construct the predi- cate on infinite groups. Stratification and Normalization 7.11 [127] With infinite groups we can use stratification for normalizing, but then we must face the possibility of getting unbounded co- efficients within equivalent (p's ; and then the group-averaging operations do not, in general, converge. This will be shown as a theorem in §10.4. We conjecture that predicates like the “twins” of §7.5 are not of finite order with bounded coefficients. In any case, it would be interesting to know whether there are such predicates. 7.11 Problems A number of lines of investigation are intriguing: what is the relation between the possible stratifications, including repeated ones, and algebraic decompositions of the group into different kinds of subgroups? For what kinds of predicates can the group- invariance theorem be extended to infinite groups? What predi- cates have bounded coefficients in each equivalence class, or in each degree? Under what conditions do the “normal-form strati- fications” of application 7 exist? For example, we conjecture that on circles or toroids , there is no bound on the order of predicates xj/ that select unique “normal form” figures under rotation groups: MX) and UgX) X = gX. We suspect that this may be the reason we were unable to extend the method of application 6 to the full Euclidean similarity group, including rotation. We note that the condition in Theorem 7.2 that the predicates { 7r y } be masks is probably stronger than necessary. We have not looked for a better theorem. Stratified predicates probably are physically unrealizable because of huge coefficients. It would be valuable to have a form of Theorem 7.2 that could help establish lower bounds on the coefficients. A stratification seems to correspond to a serial machine that operates sequentially upon the figure, with a sequence of group transformation elements, until some special event occurs, estab- lishing its membership in C y , and then applies a “matching test” corresponding to \ pj. The set of \pj s must contain information [128] 7.11 Geometric Theory of Linear Inequalities about the figures in all transformed positions, so the possibility of a perceptron accomplishing such a recognition should not suggest that the machine has any special generalization ability with respect to the group in question; rather, it suggests the op- posite! The apparent enormity of the coefficient hierarchies casts a gloomy shadow on the practicality of learning stratified co- efficients by reinforcement, since reinforcing a figure in C, cannot work until it has depressed all discriminations in all preceding classes. This is discussed further in Chapter 10 and 1 1. Example: +o recognise +L ie translates of 'Hie pccHern Lcf R be +lie half- line [xt/Xi |#3/ • • • I amJ let be the. desired frcdica+e :fV_haS exactly 3 hlaj{ scajavesjn the matter h X~] We ajiii slnow tktf Y has order 3. First aie defMe <x. soeeia\ predicate yfj 4ov- ea-cW instance °-f ip <xs -fykows-, ' = prWe. Hiwstj squares of / IS exactly I . — ~~KE ^7 r ^<i >0 S o = (^i + ^2 + --' +Z J -3 +5 a- i ^- ( +: Ej- + j-j) Note 'Hiaf each V\as cyder 1 Make 2 f = O Mow aje can excess 'jp(X) 4.S £>ontetk(n^ like f(y )= 15 flie r^h'fmst klach s^o&ife. ef X~f hoi can we express tie trnp/ieX selection 4 the correct Yj ojithin the 4 par 'Hive^lioii -framework? Yes by us/»j <a tvickj Letr he a. seance 4 numbers tW <^oa>s k rge ver^ M.rlo M x z io‘° - . . M i+| = l 0 M E Then f-fi W t^ + M S*s2 J5 + • • • + M j X J^s V " >0 "I because tine terw uuH'i the wos\ b(ack ?C <-uV( we\<^h a.11 earlier +e/r*x <xnd c# de+e^Wftfe tne Sfcjn ite uAc>(e s\jvy\ The e«+ ire cl natter* is loosed c>rt vctviouS -k? €*^UV +W sip^le biM.rre co\h : e^- The Zj term^ at th/4 ex**n^e co vie ^txndl 4© the Ti text. The Diameter-Limited Perceptron 8 8.0 In this chapter we discuss the power and limitations of the “diameter-limited” perceptrons: those in which ip can see only a circumscribed portion of the retina R. We consider a machine that sums the weighted evidence about a picture obtained by experiments <p h each of which report on the state of affairs within a circumscribed region of diameter less than or equal to some length D, that is. Diameter (S(cp)) < D. One gets two different theories when, in considering diameter- limited predicate-schemes, one takes D as (1) an absolute length, or (2) a fixed fraction of the size of R. Generally, it is more interesting to choose (1) for positive results. For negative results, (1) is usually a special case of an order- limited theory, and (2) gives different and sometimes stronger results. The theory did not seem deep enough to justify trying to determine in each case the best possible result. From a practical point of view one merely wants D to be small enough that none of the ^ s see the whole figure (for otherwise we would have no theory at all) and large enough to see interesting features. 8.1 Positive Results We will first consider some things that a diameter-limited percep- tron can recognize, and then some of the things it cannot. 8.1.1 Uniform Picture A diameter-limited perceptron can tell when a picture is entirely black, or entirely white: choose ip's that cover the retina in regions (that may overlap) and define <p,- to be zero if and only if all the points it can see are white. Then 2ipi > 0 if the picture has one or more black points, and < 0 if the picture is blank. Similarly, we could define the ip's to distinguish the all- black picture from all others. These patterns are recognizable because of their “conjunctively local” character (see §0.6): no <^-unit can really say that there is [130] 8.1 Geometric Theory of Linear Inequalities strong evidence that the figure is all white (for there is only the faintest correlation with this), but any (p can definitely say that it has conclusive evidence that the picture is not all white. Some interesting patterns have this character, that one can reject all pictures not in the class because each must have, somewhere or other, a local feature that is definitive and can be detected by what happens within a region of diameter D. 8.1.2 Area Cuts We can distinguish, for any number S , the class of figures whose area is greater than S . To do this we define a ip p for each point to be 1 if p is black, 0 otherwise. Then 2v P > S is a recognizer for the class in question. 8.1.3 Triangles and Rectangles We can make a diameter-limited perceptron recognize the figures consisting of exactly one triangle (either solid or outline) by the following trick: We use two kinds of (p's: the first has value + 1 if its field contains a vertex (two line segments meeting at an angle), otherwise its value is zero. The second kind, <p h has value zero if its field is blank, or contains a line segment, solid black area, or a vertex, but has value + 1 if the field contains anything else, in- cluding the end of a line segment. Provide enough of these $ s so that the entire retina is covered, in nonoverlapping fashion, by both types. Of course, this won’t work when a vertex occurs at the edge of a ^-support. By suitable overlapping, and assignment of weights, the system can be improved, but it will always be an approximation of some sort. This applies to the definition of “line segment,” etc., as well as to that of “vertex.” See §8.3. Finally assign weight 1 to the first type and a very large positive weight W to those of the second type. Then 2 pi + W 2 ip i < 4 will be a specific recognizer for triangles. (It will, however, accept the blank picture, as well). Similarly, by setting ps to recognize only right angles, we can discern the class of rectangles with 2 ^ + WZip, < 5. The Diameter-Limited Perceptron 8.2 [131] A few other geometric classes can be captured by such tricks, but they depend on curious accidents. A rectangle is characterized by having four right angles, and none of the exceptions detected by the ip- s. In §6.3.2 we did this for axis-parallel rectangles: for others there are obviously more serious resolution and tolerance problems. But there is no way to recognize the squares, even axis- parallel, with diameter-limited <£>’ s; the method of §7.2.5 can’t be so modified. 8.1.4 Absolute Template-matching Suppose that one wants the machine to recognize exactly a certain figure X 0 and no other. Then the diameter-limited machine can be made to do this by partitioning the retina into regions, and in each region a (p function has a value 0 if that part of the retina is exactly matched to the corresponding part of X 0 , otherwise the value is 1. Then 2 (p < 1 if and only if the picture is exactly X 0 . Note, however, that this scheme works just on a particular object in a particular position. It cannot be generalized to recognize a particular object in any position. In fact we show in the next section that even the simplest figure, that consists of just one point, cannot be recognized independently of position! 8.2 Negative Results 8.2.1 The Figure Containing One Single Black Point This is the fundamental counterexample. We want a machine 2 a^(p > 6 to accept figures with area 1, but reject figures with area 0 or area greater than 1. To see that this cannot be done with diameter- limited perceptrons, suppose that )<^J, {a\, and 6 have been selected. Present first the blank picture X Then if f(X) 2ai<pj(X), we have f (X 0 ) < 6. Now present a figure X x con- taining only one point x ]m We must then have f(X i) > 0. [132] 8.2 Geometric Theory of Linear Inequalities The change in the sum must be due to a change in the values of some of the *’ s. In fact, it must be due to changes only in *’s for which x x e S(<p), since nothing else in the picture has changed. In any case, AX i) - Ax 0 ) > o. Now choose another point x 2 which is further than D away from X\. Then no S((p) can contain both x x and x 2 . For the figure X 2 containing only x 2 we must also have AX 2 ) = 2 a,*, > 6. Now consider the figure X x2 containing both x x and x 2 . The addi- tion of the point x x to X 2 can affect only y s for which x x eS((p), and these are changed exactly as they are changed when the all- blank picture X 0 is changed to the picture X , . Therefore f{X n ) = /( X 2 ) + [/(*,) - f(X o)]. But then the two previous inequalities yield AXn) > e which contradicts the requirement that AXn) < 0. Of course, this is the same phenomenon noted in §0.3 and §2.1. And it gives the method for proof of the last statement in §8.1.3. 8.2.2 Area Segments The diameter-limited perceptron cannot recognize the class of figures whose areas A lie between two bounds A x < A < A 2 . proof: this follows from the method of §8.2.1, which is a special case of this, with A x = 1 and A 2 = 1. We recall that this recogni- tion is possible with order 2 if the diameter limitation is relaxed using the method of §1.4, example 7. 8.2.3 Connectedness The diameter-limited perceptron cannot decide when the picture is a single, connected whole, as distinguished from two or more The Diameter-Limited Perceptron 8.3 [133] disconnected pieces. At this point the reader will have no dif- ficulty in seeing the formal correctness of the proof we gave of this in §0.8. 8.3 Diameter-limited Integral Invariants We observed in §6.3.1 that convexity has order 3, but that the construc- tion used there would not carry over to the diameter-limited case, because it would not reject a figure with two widely separated convex com- ponents. On the other hand, §8.1.3 shows how a diameter-limited predi- cate can capture some particular convex figures. The latter construction generalizes, but leads into serious problems about tolerance and into questions about differentials. Suppose that we define a diameter-limited family of predicates $ c using the following idea: Choose an e > 0. Cover R with a partition of small cells Cj. For each integer k define <pj k to be 1 if C 7 D X contains an “edge” with change-in-direction greater than ke and otherwise <pjk = 0. Now consider the “integral” « <pjk- jk The contribution to the sum of each segment of curve will be e • c/e = c, where c is the magnitude of the change in direction of the segment; hence the total sum is the “total curvature.” Finally we claim that we can realize ^convex as € <p jk < 2w , jk because the total curvature of any figure must be > 2ir and only (and all) convex figures achieve the equality. We ignore figures that reach the edge of the retina and such matters. A similar argument can be used to construct a predicate that uses the signed curvature to realize functions of the Euler characteristic of the form G{X) < n , since that invariant is just the total signed curvature [134] 8.3 Geometric Theory of Linear Inequalities divided by 2tt. Of course on the quantized plane the diameter-limited predicate of §5.8.1 does this more simply. One could go on to describe more sophisticated predicates that classify figures by properties of their “differential spectra.” However, we do not pursue this because these observations already raise a number of serious questions about tolerances and approximations. There are problems about the uniformity of the coverings, the sizes of e and the diameter-limited cells Cy, and problems about the cumulative errors in summing small approximate quantities. Certainly within the E 2 — ► R square map described in Chapter 5, or anything like it, all such predicates will give peculiar results whenever the diameter cells are not large compared to the underlying mesh, or small compared to the relevant features of the T’s. The analysis, in §9.3, of ^convex attempts to face this problem. For example, we can regard the recognition of rectangles, as done in §6.3.2, as a pure artifact in this context, because it so depends on the mesh. The description in §8.1.3 of another form of the same predicate is worded in such a way that one could make reasonable approximations, within reasonable size ranges. 8.4 Proof of Uniqueness of the Eulerian Invariants for Diameter-Limited Perceptrons In this section we show, as promised at the end of Chapter 5, that Theorem 8.4: Diameter-limited perceptrons cannot recognize any nontrivial topological properties except the Eulerian predicates \E(X) > n 1 and [ E(X) < n], proof: The argument of §5.8 shows that \p(X) must be a function of E{X). This is immediate for the absolute diameter-limit, which is a special case of order-limit. The argument, with suitable modi- fications, carries over to relative diameter-limits. Now consider two figures A and B that differ only in a single interior square: The Diameter-Limited Perceptron 8.4 [135] A^her sense o{ "local'] tkx+ -fuses fk<£ Jiawefev'-liwif ctioc( o^ev'-li}oi/'t i t>6i/wfs each cf> io defend on n. ve^iotys o^r smalt clique' n?iP Xle "Vlaqf(" suck a- h as PiFFERFror/^L prpfr n. . Moe ^v^JicA-fe r>( ;s cc ^of/'i s^uave * Was cUffevevihal o\fe(e\r We j acm/ fee( “tWt" Tkus is tin ^ wosf t ^ 4e ires fl 1^0 ‘Vesfncfftfj'i' 1 fsee P. ■f°\r irese^irck on cc^puf^f/ottet ( mefAj ^laviij pvacf/cW n^<a> rhc M6c~f /c a i cxnj p^y Slo logical veCjSQ\AS . keLa;} where the circle shows the range of the diameter limit. Then suppose that \p(X) = [2 <p(X) > d] and consider the difference A = - Za^fp(A). Now if it happens that A > 0 then HB) > HA), hence removing a hole cannot decrease \p. By topological equiva- lence, adding a component has the same effect upon E(X) and hence upon \p(X). Thus, if A > 0, then E(B) > E(A) =► HB)>HA), and similarly, if A < 0 then E(B) > E(A ) =► HB)<HA). It follows that in each case there must be an n such that (if A > 0) HX) = f E(X) > n] or (if A < 0) HX) = \E(X) < n) or else \p is a constant. The trivial exceptions are the constant predicates and the uni- form predicates of §8.1.1, which are exceptions to the canonical form of §5.8. TTie di-^eirenfiaLov-Jev- \<U<L swfteeh fr*€ the tW^y fi r°vn a wnoyin^ (w\eWd:£ of "the disc ye {e ^-fifioH of fiie'Vehka" (r^ooo we ox * W/c akorf AjLj u/iftiout- ye^ev-eiace f© l^e airfifid*l c(/\edse^hooi^y \uc cav\ vnore- clewlij ~fu:e ^rot|e^s alwf aj|>ir0Xiim4/0(A ewovs avU -klemn&z t deal JiwcHy uuifii co^cejVs blfc cowflwo/K , <9r ^KM/exify, yepk<:(yi4 p|x? ^ i 6 ^ uy w^-fwcd cowibm«fiov\c(( onoo^-s vnove <Wvof\n<<.te /HffliWs oT V v i€fv'i^<x( CKnd { \ov\ TVi €tfy/e$ Geometric Predicates and Serial Algorithms 9 9.0 Connectedness and Serial Computation It seems intuitively clear that the abstract quality of connected- ness cannot be captured by a perceptron of finite order because of its inherently serial character: one cannot conclude that a figure is connected by any simple order-independent combination of simple tests. The same is true for the much simpler property of parity. In the case of parity, there is a stark contrast between our “worst possible” result for finite-order machines (§3.1, §10.1) and the following “best possible” result for the serial computa- tion of parity. Let Xi, x 2 , . . . , x„ be any enumeration of the points of R and consider the following algorithm for determining the parity of \X\: start: Set / to 0. even: Add 1 to i. If / = |/?| then stop; i/' PARITY = 0. If X/ = 0, go to even; otherwise go to odd. odd: Add 1 to /. If / = \R\ then stop; ^ PARIXY = 1. If X/ = 0, go to odd; otherwise go to even. Now this program is “minimal” in two respects: first in the num- ber of computation-steps per point, but more significant, in the fact that the program requires no temporary storage place for partial information accumulated during the computation, other than that required for the enumeration variable /. [In a sense, the process requires one binary digit of current information, but this can be absorbed (as above) into the algorithm structure.] This suggests that it might be illuminating to ask for connected- ness: how much memory is required by the best serial algorithm? The answer, as shown below, is that it requires no more than about 2 times that for storing the enumeration variable alone! To study this problem it seems that the Turing-machine framework is the simplest and most natural because of its uniform way of handling information storage. 9.1 A Serial Algorithm for Connectedness Connectedness of a geometric figure X is characterized by the fact that between any path {p,q ) of points of X there is a path that lies Geometric Predicates and Serial Algorithms 9.1 [137] entirely in X. An equivalent definition, using any enumeration X \, . . . , x\ R \ of the points of R is: X is connected when each point x t in X , except the first point in X , has a path to some Xj in X for which i > j. (Proof: By recursion, then, each point of X is connected to the first point in X.) Using this definition of con- nectedness we can describe a beautiful algorithm to test whether X is connected. We will consider only figures that are “reasonably regular” — to be precise, we suppose that for each point x t on a boundary there is defined a unique “next point” x t * on that boundary. We choose x t * to be the boundary point to one’s right when standing on x t and facing the complement of X. We will also assume that points x t and x i+ \ that are consecutive in the enumera- tion are adjacent except at the edges of R. Finally, we will assume that X does not touch the edges of the space R. Assuror a\$o iW X neifey becomes jUst ov\e square <xS vni5|<3^i^iu Set i to 0 and go to search. sV) 0 «/*v in -toe next- Add 1 to /. If / = |/?| , stop and print “X is NULL.” If Xi e X then go to scan, otherwise go to search. Add 1 to i. If / = \R \ , stop and print “X is con- nected. ” If Xj _ j / X and X (€ X then set j to / and go to trace, otherwise go to scan. Set j to j*. If j = z, stop and print “X is disconnected.” If j > z, go to TRACE. If 7 < Z, gO tO SCAN. Notice that at any point in the computation, it is sufficient to keep track of the two integers i and j; we will see that no extra memory space is needed for |/?|. Analysis: search simply finds the first point of X in the enumera- tion of R. Once such a point of X is found, scan searches through all of R, eventually testing every point of X. The current point, x,-, of scan is tested as follows: If x t is not in X, then no test is necessary and scan goes on to x l+l . If the previous point x,_j was in X (and, by induction, is presumed to have passed the test) then x h if in X, is connected to x { _i by adjacency. Finally, if x l e X and start: search: scan: trace: [138] 9.1 Geometric Theory of Linear Inequalities Geometric Predicates and Serial Algorithms 9.2 [139] before, or (2) B is an interior boundary curve, in which case a point of B must have been encountered before reaching which is inside B, or (3) B is the exterior boundary curve of a never- before-encountered component of X, the only case in which trace will return to x,- without meeting an Xj for which j < i. Thus scan will run up to i = |i?| if and only if X has a single nonempty connected component (see Figure 9.1). 9.2 The Turing-Machine Version of the Connectedness Algorithm It is convenient to assume that i? is a 2" x 2" square array. Let X \ , . . . , JC|*| be an enumeration of the points of R in the order 1 , 2 " + 1 , 2 , 2 " + 2 , 2 ", 2 " + 2 ", ( 2 " - 1 ) 2 " + 1 , (2 n - 1 ) 2 " + 2 , ( 2 " - 1 ) 2 " + 2 ". This choice of dimension and enumeration makes available a simple way to represent the situation to a Turing machine. The Turing machine must be able to specify a point x,- of R, find whether x,- e X , and in case x t is a boundary point of X, find the index /* of the “right neighbor” of x t . The Turing-machine tape will have the form } where “. .n . .” denotes an interval of n blank squares. Then the intervals to the right of I x and I y can hold the x and y coordinates of a point of R. We will suppose that the Turing machine is coupled with the out- side world, that is, the figure X, through an “oracle” that works as follows: certain internal states of the machine have the property that when entered, the resulting next state depends on whether the coordinates in the /(or J) intervals designate a point in X. It can be verified, though the details are tedious, that all the operations described in the algorithm can be performed by a fixed Turing machine that uses no tape squares other than those I* ... ... I* HP ■ . . ~n . . . J# EE K [140] 9.2 Geometric Theory of Linear Inequalities in the intervals. For example, “/ = |jR|” if and only if there are all zeros in the “. .n . .” ’s following I x and I y . “Add 1 to /” is equivalent to “start at J y and move left, changing l’s to 0’s until a 0 is encountered (and changed to 1) or until l y is met. The only nontrivial operation is computing j* given j. But this re- quires only examining the neighbors of Xj, and that is done by adding ± 1 to the J x and J y coordinates, and consulting the oracle. Since the Turing machine can keep track of which “. .n . .” interval it is in, we really need only one symbol for punctuation, so the Turing machine can be a 3-symbol machine. By using a block en- coding, one can use a 2-symbol machine, and, omitting details, we obtain the result: Theorem 9.2: For any e there is a 2-symbol Turing machine that can verify the connectedness of a figure X on any rectangular array /?, using less than (2 + e) log 2 \R\ squares of tape ! We are fairly sure that the connectedness algorithm is minimal in its use of tape, but we have no proof. (In fact, we are very weak in methods to show that an algorithm is minimal in storage; this is discussed in Chapter 12.) Incidentally, it is not hard to show that [ \X\ is prime] requires no more than (2 + e) log 2 \R\ squares (and presumably needs more than (2 - e) log 2 \R\). We have little definite knowledge about geometric predicates that require higher orders of storage, but we suspect that, in an ap- propriate sense, recognizing the topological equivalence of two figures (for example, two components of X) requires something more like |/?| than like log \R\ squares. There are, of course, recursive function-theoretic predicates that require arbitrarily large amounts of storage, but none of these are known to have straightforward geometric interpretations. 9.2.1 Pebble Automata A variant of this computation model has been studied by M. Blum and C. Hewitt. The Turing machine is replaced by a finite- state automation which moves about on the retina, reading the color of the cell on which it is currently located. As a function of this input and its current state, the automaton determines its next state and one of four possible moves: north, east, south, west. A properly designed automaton should operate on any Geometric Predicates and Serial Algorithms 9.3 [141] retina, however large, provided that it is given a way to detect the edge of the array. This is a convenient way to realize the idea of a predicate-scheme. The position of the automaton on the retina plays the role of one of the two print indices I and J remembered by the Turing machine. To give the machine the effect of the second point index, it can be provided with a pebble that can be left anywhere on the retina and retrieved later. We leave to readers the extremely tricky exercise of translating the Turing machine algorithm into a form suitable for an automaton with one pebble. Can con- nectedness be recognized without using the pebble? Surely not, but we have not proved it! s + i ,| M a;s '7i! 9.3 Memory-Tape Requirements for ^convex For convexity we can also get a bound on the tape memory. However, since convexity is a metric property, one must face the problem of precision of measurement vis-a-vis the resolution of the finite lattice of R. It seems reasonable to ask that the figure have no indentations larger than the order of the size of a lattice square. One way to verify this is to check, for each pair (a, b) of boundary points, that there is no such indentation: x > To make this test seems to require the equivalent of scanning the squares close to the ideal line from a to b, and some memory is required to stay sufficiently close to its slope. For each increment [142] 9.3 Geometric Theory of Linear Inequalities in (say) y one must compute for x the largest integer in and the remainder, with its log 2 « digits must be saved for the iterative computation. Thus the computation can be done if one stores log 2 n digits for each of a , b, x , y, and r, where r(y) = the remainder of ^ — - n which can be obtained from a register containing jc and r by adding b — a at each step: overflow >r~^ 1 r Thus one can test for convexity by using the order of § log 2 \R\ squares. There is an evident redundancy here since (for example) a can be reconstructed from the other four quantities, and this suggests that with some ingenuity one could get by with just (2 + e) log 2 \R\. In any case we have no idea of how to establish a lower bound. Although convexity is simpler than connectivity in that it is conjunctively local , this is no particular advantage to the Turing machine, which is well suited for recursive computation, and this simplicity is quite possibly balanced by the complication of the metric calculation. So we are inclined to conjecture that both ^convex and ^connected require about 2 log 2 |/*| tape squares for realization by Turing machines. We regard our inability to prove a firm lower bound as another symptom of the general weakness of contemporary computation theory’s tools for establishing min- imal computational complexity measures on particular algo- rithms. 9.4 Connectedness and Parallel Machinery We have seen that there exists a Turing machine that can compute ^connected with very little auxiliary storage in the form of memory tape. The computation requires an appreciable amount of time, or number of steps of machine operation. The number of Turing- machine steps appears to be of the order of \R\ log \R\ for Geometric Predicates and Serial Algorithms 9.4 [143] reasonably regular figures (for “bad” figures there may be a term of order |/?| 2 log \R\). On the other side, the Turing machine requires a remarkably small amount of physical machinery, which is used over and over again in the course of the computation. If one has more machinery, one should be able to reduce the number of time steps required for a computation, but we know very little about the nature of such exchanges. In the case of realizing ^connected* one can g ain time subdividing the space into regions and computing, simultaneously, properties of the connectivity within the regions. For example, suppose that we had machines capable of establishing, in less than the time neces- sary to compute ^connected f° r the whole retina, a “connection matrix” for boundary points on each quadrant. In the figure, this means knowing that a is connected to a ', b to b\ and so on. The connectedness of the whole can then be decided by an algo- rithm that “sews” together these edges. If the mesh is made finer, the computations within the cells be- come faster, but the “sewing” becomes more elaborate. On the other hand, the subdivision can probably be applied recursively to the sewing operation also, and we have not studied the possible exchanges. We can find an interesting upper bound for an extreme case: Suppose the machine is composed entirely of Boolean func- tions of two arguments; then how much time is required for such [144] 9.4 Geometric Theory of Linear Inequalities a machine to compute ^connected » assuming that each Boolean operation requires one time unit? Suppose, for convenience, that R has \R\ = 2 n squares (points). Certain pairs of points are considered to be “adjacent,” and by chaining this relation we can describe ^connected by a particularly compact inductive definition; we write C\j{X) = \x t A Xj A (x/is adjacent to jc 7 -)1 and 1*1 CTj + \X) = V CT k (X) A CTj(X). (1) k= 1 Each point x,- is considered to be connected to itself, so that Cu(X) = \x t e X]. Then it can be seen inductively that C™(X) is true if and only if x,- and x y are connected by a chain of <2 m of adjacent points, all in X. The whole figure is connected, that is, ^connected W = 1, if Clj(X) = 1 for every pair for which x, e X and Xj € X. Hence CONNECTED = f X i A Xj >C jj(X) 1 1*1 1*1 = A A [jc, Vxj VC7j(X)]. (2) /=1 j=\ This function can be composed in a machine with a separate layer for each level of CTj . To connect C™ +l to the appropriate CT/s requires bringing together up to |/?| terms, using Equation 1, and this requires a tree of or\ of at most n = log 2 \R\ layers Geometric Predicates and Serial Algorithms 9.4 [145] in depth. There are n such layers so the total time to compute C"j is of the order of n 2 . Using Relation 2, we find the final com- bination requires about 2 n layers so we have time (\^ CONNECTED ) < (log|i?|) + k • log |/?|, where k is a small constant.* We doubt that the computation can be done in much less than the order of (log |R |) 2 units of time , with any arrangement of plausible computer ingredients arranged in any manner whatever. Notice that we were careful to count the delay entailed by the or opera- tions. If this is neglected the computation requires only log \R\ steps, but this is physically unrealistic for large \R\. Indeed, we really should prohibit the unlimited “branching” or copying of the outputs of the elements; if the amplifiers that are physically necessary for this are counted we have to replace our estimate by 3(log |R|) 2 . As usual, we have no firm method to establish the lower bound. However, the following pseudoproof seems rele- vant: 1. Using more “ memory ” in the machine doesn’t seem to help. Can the machine be speeded-up by storing a library of connected figures and identifying them rather than working out the defini- tion of connectivity each time? The extreme: build a library of all connected figures on R. A tree of binary Boolean operators can be built to match any pattern in just log |jR| time steps. This greatly speeds up the analogue of part 1 above. But there are so many different connected figures that one has now to or together of the order of 2 0|/?l terms (where d is some fraction § < 0 < 1) so the analogue of part 2 takes log (2* |/?l ) = 0 |R| steps, which is worse than (log |R|) 2 for large R. Of course this is not a proof, but we think it is an indication. 2. Using loops cannot increase speed. The (log |R|) 2 machine is a loop-free hierarchy of Boolean functions — it has no “serial” computation capacity except that which lies in its layered struc- ture. One could vastly reduce its number of parts (of which there are the order of |/? | 3 - log \R\) by making a network with loops: in- *This construction was suggested to us by R. Floyd and A. Meyer. [146] 9.4 Geometric Theory of Linear Inequalities deed we could build a Turing machine that would have only k log \R\ parts, for some modest k. But, for a given computation of bounded lengths, the fastest machine with loops cannot be faster than the fastest loop-free machine (ignoring branching costs). For one can always construct an equivalent loop-free machine by making copies of the original machine — one copy for each computation step — with all functions taking arguments from earlier copies. 3. The connection-matrix scheme seems hard to improve. There exist figures with nonintersecting paths of length the order of |/?|. It seems clear that any recognition procedure using two-argument functions requires at least log |/?| steps, because one cannot do better than double the path length at each step, as does our Cy connection-matrix method. At each such step there must be of the order of \R\ alternative connections that must be or-ed together. Perhaps the proof could be completed if one could show that nothing can be gained by postponing these or' s, so that each re- quires log | R | logic levels. 9.5 Connectedness in Iterative Arrays Terry Beyer has investigated the time necessary to compute ^ connected m a situation that provides a different and perhaps more natural model for parallel geometric procedures. Suppose that each square of a retina contains an automaton able to com- municate only with its four neighbors. It can also tell the state (black or white) of its square. The final decision about whether the figure is connected or not is to be made by some fixed automa- ton, say the one in the top left-hand corner. On the assumption Geometric Predicates and Serial Algorithms 9.5 [147] See. Weddell I. ' Recoin f’hoia To^c?\oa\C<x[ IWairuiw+s bv^ ItevVhue fVO. ^issev^tiottj/M.r^ ^Ocf- (*76T. that the states change only at fixed intervals of time, we ask how many time units must pass before the decision can be made. It is obvious that on an n x n retina this will take at least In time units, for this is the time required for any information to pass from the bottom right corner to the top left. It is not difficult to design arrays of automata that will make the decision in the order of n 2 (that is, |/? |) time units. Beyer’s remarkable result is that (2 + e)n is sufficient, where e can be made as small as one likes by allowing the automaton to have sufficiently many states. Thus the order of magnitude of time taken by the array is propor- tional to VWU which is (naturally) intermediate between the times taken by the single serial machine ( \R\) and the unrestricted parallel machine which is known to take <(log |/?|) 2 . The following gives an intuitive picture of Beyer’s (unpublished) algorithmic process. The overall effect is that of enclosing a com- ponent in a triangle as shown below, and slowly compressing it into the northwest corner by moving the hypotenuse inward. Each component is compressed to one isolated point before vanishing. Whenever this event takes place it can be recognized locally and the information is transmitted through the network to the corner. Thus the connectedness decision is made positively or negatively depending on whether such an event happens once or more than once. More precisely, the compression process starts by finding the set of all “southeast corners” of the figure. The center square is a SE corner if the South and East are empty. All other squares shown may be empty or full. [148] 9.5 Geometric Theory of Linear Inequalities In the compression operation, each SE corner is removed, while inserting a new square when necessary to preserve connectedness as shown in the next figure: X T(x) because this would break the connection: The diagonal lines show how repetition of this local process does squeeze the figure to the northwest. Repeated applications of T eventually reduce each component to a single point. The next figure shows how it (narrowly but effec- tively) avoids merging two components. It is easy to see that a component within a hole will vanish (and be counted) just in time to allow the surrounding component to collapse down. We do not know any equivalent process for three dimensions. (Consider knots!) LEARNING THEORY [150] Learning Theory Introduction to Part III Our final chapters explore several themes which have come, in cybernetic jargon, to be grouped under the heading “learning.” Up to now we discussed the timeless existence of linear represen- tations. We now ask how to compute them, how long this takes, how big they are, and how efficient they are as a means of storing information. A proof, in Chapter 10, that coefficients can grow much faster than exponentially with |7?| has serious consequences both practically and conceptually: the use of more memory ca- pacity to store the coefficients than to list all the figures strains the idea that the machine is making some kind of abstraction. Chapter 1 1 clarifies the remarkable perceptron convergence theorem by relating it to familiar phenomena associated with finite-state machines, with optimization theory and with feedback as a computational device. Chapter 12 abandons the strict definition of the perceptron to study a larger family of algorithms based on local partial predi- cates. These include methods (like Bayesian decisions) used by statisticians, as well as ideas (like hash coding) known only to programmers. Its aim is to indicate an area of computer science encompassing these apparently different processes. We dramatize the need for such a theory by singling out a simply stated un- solved problem about the more direct and commonly advocated methods for the storage and retrieval of information. Magnitude of the Coefficients 10 10.1 Coefficients of the Parity Predicate In §3.1 we discussed the predicate i/wityC^) = \\%\ * s an °dd numberl and showed that if <£ is the set of masks then all the masks must appear in any L(4>) expression for imparity- O ne suc h expression is tfWvW = fS(-2) |S( "V(Y) < -11 which contains all masks of <i> with coefficients that grow ex- ponentially with the support-size of the masks. We will now show that the coefficients must necessarily grow at this rate , because the sign-changing character of parity requires that each coefficient be large enough to drown out the effects of the many coefficients of its submasks. In effect, we show that ^ PARIT Y can be realized over the masks only by a stratificationlike technique! So suppose that we have i^ PARITY = f 2 a/ <Pi > 01. Suppose also that the group- invariance theorem has been applied to make equal all a s for all (p's of the same support size, and suppose finally that the dis- crimination of \p PAR1TY is “reliable,” for example, that > 1 for odd parity and < 0 for even parity. Then we obtain the inequalities Oi i > 1, a 2 + 2a\ < 0 , 0^3 ~b 3 q?3 + 3 OL\ ^ 1, a 4 -I- 4a: 3 -f 6 a 2 + 4 ct\ < 0, by applying the linear form to figures with 1,2,3,... points. The general formula is then obtained by noticing the familiar binomial coefficients, and proving by induction that >1 if n is odd, <0 if n is even. [152] 10.1 Learning Theory Next, by subtracting successive inequalities, we define D„ = £ n + 1 = a „+ i n + l\ ?,:■ «■ «/ = «n+l + so that for all n = 0, 1, 2, . . . (-1 ) n D n > 1. Using these inequalities, we will obtain a bound on the coeffi- cients {a,). We will sum the inequalities with certain positive weights; choose any M > 0, and consider the sum E M (-im>E f -2- The left-hand side is M l IZ(-D' / A M' U Mi i m m yw m = IEh )' M M ZIh )' A: ! (/ - Jfc)! j \/!(M - /)! M! \/ (M - k)\ k\(M - k)\)\(i - k)\(M - /)! & / ’ - k - j)\ = Z « t + , Y (-!)*(! - l)' = «A/+l(— 1)^, Magnitude of Coefficients 10.2 [153] so we obtain I^AZ+ll > 2 M . Theorem 10.1: In any “reliable” realization of \js PAR1TY as a linear threshold function over the set of masks, the coefficients grow at least as fast as 2 I5( * ,)I ~ I . These values hold for the average, so if the coefficients of each type are not equal, some must be even larger! This shows that it is impractical to use masklike <p’s to recognize paritylike func- tions: even if one could afford the huge number of <p’s, one would have also to cope with huge ranges of their coefficients! remark: This has a practically fatal effect on the corresponding learning machines. At least 2 |/?i instances of just the maximal pattern is required to “learn” the largest coefficient; actually the situation is far worse be- cause of the unfavorable interactions with lower-order coefficients (see §11.4). It follows, moreover, that the information capacity necessary to store the set {«,) of coefficients is as much as would be needed to store the entire set of patterns recognized by ^ PARITY — that is, the odd subsets of \R\. For, any uniform representation of the a/s must allow \R \ — 1 bits for each, and since there are 2^ coefficients the total number of bits required is |/£ | • 2 |/?l ~ *. On the other hand there are2 |/?l_1 odd sub- sets of R, each representable by an |/?|-bit sequence, so that \R \ -2 |/?l_1 bits would also suffice to represent the subsets. And the coefficients in §10.2 would require much more storage space. It should also be noted that Verity ls not ver y exceptional in this regard because the positive normal-form theorem tells us that all possible 2 2 '* 1 Boolean functions are linear threshold functions on the set of masks. So, on the average , specification of a function will require 2 |/?l bits of coeffi- cient information, and nonuniformity of coefficient sizes would be ex- pected to raise this by a substantial factor. 10.2 Coefficients Can Grow Even Faster than Exponentially in \R\ It might be suspected that ^p ARITY is a worst case both because (1) parity is a worst function and (2) masks make a worst <£. In fact the masks make rather a good base because coefficients over masks never have to be larger than \a t | = 2 1 S(iPi) 1 , as can be seen by expanding an arbitrary predicate into positive normal form. We now present a new predicate \p EQ , together with a rather horrible <J>, that leads to worse coefficients. Let R be a set of points, y \, . . . ,y n , zj, , . . , z„ and let { Y t ] and {Z,} each be enumerations of [154] 10.2 Learning Theory the 2 n subsets of the y ’ s and z’s, respectively. Then any figure X C R has a unique decomposition X = Y } U Z k . We will consider the simple predicate \p EQ , u Z k ) = \j = * 1 , which simply tests, for any figure X , whether its Y and Z parts have the same positions in the enumeration. The straightforward geometric example is that in which the two halves of R have the same form, and T, and Z, are corresponding sets of y and z points. We will construct a very special set <£ of predicates for which \p EQ eL($) and show that any realization of \p EQ in L(4>) must in- volve incredibly large coefficients! We want to point out at the start that the 4> we will use was designed for exactly this purpose. In the case of imparity we saw that coefficients can grow exponentially with the size of \R |; in that case the <t> was the set of masks, a natural set, whose interest exists independently of this problem. To show that there are even worse situations we construct a <t> with no other interest than that it gives bad coefficients. We will define <J> to contain two types of predicates: MYj U Z k ) = '[/ = k], Xi (Yj U Z k ) = \(j = k A / = *) V (j = k - 1 A i < k) 1, each defined for / = 1,...,2”. Note that |5(^/)| = n and |S(X/)I = 2 n. First we must show that \[/ EQ e L(4>). But consider the formula eq = [22^, - X/) < 11- Case I: j = k Then i p k = 1 and Xk = L hence \p EQ = \2 k (\ - 1) < 11 is true. Case II: j ^ k and j ^ k - 1 Then only \p k = 1 and \p EQ = \2 k < 11 is false. Case III: j = k - 1 Then \p k = 1 and x / = 1 for / = 1 - 1 . So Magnitude of Coefficients 10.2 [155] k - 1 2 * - ^ 2 ' < 1 [2 < 1] is FALSE. and the predicate holds only for the j = k case, as it should. So \p EQ is indeed in L(<F). Now we establish bounds on the coefficients. Consider any ex- pression \p EQ = [2a/X/ + 2 > 01. Then for sets T^+i U Z^weget/^ < 6 , for sets Fyt U Z^wegeta^ + /?* > 6 + 1, {strong separation) for sets T*_i U Z* we get a l 4- • • • + a k _ j 4- < 6. We can set 0 = 0 by subtracting it from every /?, since just one (3 appears in each inequality. So /5j < Oand^] > 1. And since a k > 1 +«! + ••• + a k _ i we have immediately a 2 > 2, a 3 > 4, . . . , a y > 2 7 \ Because the index j runs from 1 to 2”, the highest a must be at least 2 2 times as large as the initial separation term {a x + fi { ) - /3 { = This incredible growth rate is based in part on a mathematical joke: we note that an expression “y = A:” equivalent to that for iAeq appears already within the definitions of the x/s and it is there precisely to not-quite-fatally weaken their usefulness in Lm Ironically, if we write \p EQ in terms of masks we have 'Peq = r S(^i + z, - 2 ytZi < 1)1, and the coefficients are very small indeed! problems: Find a <£ that makes the coefficients of iA PARIT y grow like 2 . Solution in §10.3. In §10.1, $ has 2 1 1 elements and iA parity requires coefficients like 2 i/? L In §10.2 $ has elements, but the coefficients are like 2 2 ^ . It is possible to make ^’s with up to 2 2 * * elements. Does this mean there are \ A’s and <F’s with coefficients like 2 2 ? (We think not. See §10.3.) [156] 10.2 Learning Theory Can it be shown that one never needs coefficient ratios larger than 2 ^ for any <t>? Can we make more precise the relations between coefficient sizes and ratios. Can it be shown that the bounds obtained by assuming integer coefficients give bounds on the precisions required of arbitrary real coefficients? Can you establish linear bounds for coefficients for the predicates in Chapter 7? The linear threshold predicate *eq = [2 2' (ypi - X i) > 0] is very much like those obtained by the stratification-theorem method, in that at each level i the coefficient is chosen to dominate the worst case of summation of the coefficients of preceding levels. The result of theorems of §10.1 and §10.2 is that for those predicates there do not exist any linear forms with smaller co- efficients, and this suggests to us that (with respect to given <F’s) perhaps there is a sense in which some predicates are inherently stratified. We don’t have any strong ideas about this, except to point out that there is a serious shortage of computer-oriented concepts for classification of patterns. We do not know, for most of the cases in Chapter 7, which of them really require the strati- ficationlike coefficient growth: that is to say, we don’t have any general method to detect “inherent stratification.” 10.3 Predicate With Possibly Maximal Coefficients Define \\X || to be the index of X in an ordering of all the subsets of /?. We will consider the simple predicate ^ || PARITY || = MU II is oddl with respect to the following set <f> of predicates: (0 if IU|| < i, <Pi(X) HI if lull = /, [(11*11 - /) mod 2 if II *11 > /. Then \p h parity l! is in U$) and is i n fact realized by 1 P || PARITY II ~ U (“ 0 fiPi < 01? where / is the zth Fibonacci number (/„ = /„_ i + f n -iY \ft j = {1, 1,2, 3, 5, 8, 13,... j. Magnitude of Coefficients 10.3 [157] Theorem 10.3: Any form in L($) for \p n PAR1TY< must have co- efficients at least this large; since the / grow approximately as i (vj + iY VT\ 2 ) the largest coefficient is then of the order of magnitude of — 2“ ^ where a = iog 2 ( V ^ 2 + ' ) The proof of the theorem can be inferred by studying the array below: II A',- 1| / 1 2 3 4 5 6 7 8 9 ... 1 -1 i 1 0 1 0 1 0 1 0 ... 2 + 1 0 1 1 0 1 0 1 0 1 3 -2 0 0 1 1 0 1 0 1 0 ... 4 + 3 0 0 0 1 1 0 1 0 1 5 -5 0 0 0 0 1 1 0 1 0 ... 6 + 8 0 0 0 0 0 1 1 0 1 7 -13 0 0 0 0 0 0 1 1 0 ... It can be seen that if a, < 0 and the coefficients are integers then C ' •*2, + i < ~ T, a 2j 7- i l °2i V y-l and the reader can verify that this implies that for all a h I oti + 1 | > \aj | + |a,-_i | ; hence \a, | > /. Discussion and conjecture: This predicate and its $ have the same quality as that in §10.2 — that the v?’s themselves are each almost the desired predicate. Note also that by properly ordering the sub- [158] 10.3 Learning Theory sets, we can arrange that ’A || PARITY || = ^A PARITY We conjecture that this example is a worst case: to be precise, if 4> contains |<L| elements, the maximal coefficient growth cannot be faster than 1*1 where the exponent constant is the Fibonacci, or golden-rectangle, ratio. Our conjecture is based only on arguments too flimsy to write down.* 10.4 The Group-Invariance Theorem and Bounded Coefficients on the Infinite Plane In §7.10 we noted a counterexample to extending the group- invariance theorem (§2.3) to infinite retinas. The difficulty came through using an infinite stratification that leads to unbounded coefficients. This in turn raises convergence problems for the symmetric summations used to prove equal the coefficients within an equivalence class. If the coefficients are bounded, and the group contains the translation group, we can prove the cor- responding theorem. (We do not know stronger results: presum- ably there is a better theorem with a summability-type condition on the coefficients and a structure condition on the group.) The proof uses the geometric fact that for increasing concentric circles about two fixed centers the proportion of area in common ap- proaches unity. *Such as the fact that vT occurs in upper bounds in the theories of rational approximations and geometry of numbers. Magnitude of Coefficients 10.4 [159] 10.4.1 Bounded Coefficients and Group Invariance Let \p be a predicate invariant under translation of the infinite plane. Theorem 10.4.1: If the coefficients of the <^s are bounded in each equivalence class, then there exists an equivalent perceptron with coefficients equal in each equivalence class. proof: Let T c be the set of translations with displacements less than some distance C. Let \p = \^a((p)<p > 6}. Now define tc(X) = - X *>00 Z «(**■') > X ° * T c T c = X^(^)X«(^)- x^. * T c T c because T c is carried onto itself under the group inverse. By the argument df §2.3 each \ p c is equivalent to ^ as a predicate. The following lemma shows that we can select an increasing sequence R u R 2, ... of radii for which the limit XX a (<p)<fi(gX) - d) > 0 lim X «(<?£) g( Tr i has the same value independent of <p within every equivalence class. Lemma: Suppose some function f(x) is bounded, that is, \f{x)\ < M , in E 2 . Then there exists a sequence of increasing radii R> such that for any system of concentric circles with these radii, the value of lim — ^ Ur f f(y)dA j ^ x ■jf ft J J I y-p I < R i will be the same, independent of the selection of the common center /?,- if t - fre limit ex i sts for an y e efrter a -t-a-H-. [160] 10.4 Learning Theory PROOF: Let jp b e+he Since ol|( v/^ltces o(fli6 I h'e m +be m-feir^l any \h-fiwife se^oznce of tViem wnAs-f' hq\/^ dM Mo-finifc co n i/-e (r<{ ent" sob- segue, nee, choose suck ct c<?Ku/e ^uJ&s^gL'ence 4ir<?m^ tbe enrcles luiH v^ql J ? 2 l. J 2. J 3 J -efc. L c + 5ucIl cl sequence , HPRO or: Choose as center the origin - and any sequence of~R/s i n- crea s ing wit - bout botmdr Then for each / we have ^ f Mda \y\ <Ri < M. Given any other center p for the circles, note that J Ay) dA I v | <R, J f(p + y)dA I v | </?, < 2-M-A,(/?), where A,(/?) is the area of nonoverlap between the two disks | y | < /?,and \y— p | < R im But as the radius grows, for any p lim = 0 so the two sequences approach the same limit -fif-afty)-. Q.E.D. To prove the main theorem, we simply choose a representative ip from some equivalent class, and set / (g) = a(ipg), regarding g as a translation from the origin. It follows that the perceptron obtained in §7.4 must have un- bounded coefficients, and there is no equivalent representation in L(4>) with bounded coefficients. T he limi - t -- require 4-- by Lh€ - 4€mm ^- may - ne t exist; - in - fact it i s - e asy - - to construct counterexamples. Prob a biy r th - is me a ns - that Theorem 10.4.1 is not strictly true, but we do not think th e e xc e ptions ar e im portant. W e - do - not - know a d e finit e counterexample - to ■ the - -the orem, as stated. Note: The methods of §10.2 and §10.3 are similar to those used by Myhill and Kautz [1961] to find maximal coefficients for the order- 1 case. They show that with integer coefficients there is an order- 1 predicate for which some coefficient must exceed 2/e- 1 /n • 2 n . Learning 11 11.0 Introduction In previous chapters we used no systematic technique to find a representation of a predicate as a member of an L(<F). Instead, we always constructed coefficients by specific mathematical analysis of the predicate and the set 3>. These analyses were ad hoc to each predicate. In this chapter we study situations in which sets of co- efficients can be found by a more systematic and easily mechanized procedure. It is the possibility of doing this that has given the perceptron its reputation as a “learning machine.” The conceptual scheme for “learning” in this context is a machine with an input channel for figures, a pair of yes and no output indicators, and a reinforcement or “reward” button that the machine’s operator can use to indicate his approval or disap- proval of the machine’s behavior (see Figure 11.1). The operator has two stacks F + and F“ of figures and he would like the ma- chine to respond yes to all figures in F + and no to all figures in F - . He indicates approval by, say, pressing the button if the re- action is correct. The machine is to modify its internal state better to conform to its master’s wishes. There are many ways to build such a machine. The most obvious scheme is to have some kind of recording device to store incoming figures in two separate files, for F + and F~. This kind of machine will never make a mistake on a previously seen figure but, along with its never-forgetting, it brings other elephantine characteristics. Another, very different kind of machine would attempt to find descriptive characteristics that distinguish between the figures of the two classes, and to use new figures to sharpen and elaborate these descriptions. This kind of machine would, in the long run, require less memory but its mechanism and its theory are both much more complicated. If the classes F + and F” are very large then the first [162] 11.0 Learning Theory machine is disbarred; if there is no description within the practical repertoire of the second machine, it will fail. The perceptron, as a pattern-discriminating machine, lies between these two paradigms. It is not a pure memory-matching machine, for it does not store the pictures. As a description-machine its repertoire is limited (as we have seen in the previous chapters) to what can be done with “local” features of the patterns and only linear threshold relations between these features. The existence of the simple learning procedures described below results from this restriction on the machine’s descriptive power (and could be regarded as a partial compensation for this limita- tion). Let us suppose that the machine contains a perceptron with a fixed <i> and adjustable coefficients. When a figure X is presented the sum 2 aM * 0 is computed. If X belongs to F + and this sum is positive, the machine responds yes and all is well. If X belongs to F + but the sum is negative, the machine responds no. This is bad, and some- thing must be done. What is the simplest possible correction procedure? The first idea that comes to mind, especially to people who have grown up on the idea of feedback, is the following: Since the sum was too small, let’s increase its coefficients. If it had come out too large (namely, response yes for a figure in F“), we would decrease coefficients. But we must adjust the coefficients in a reasonable manner, so that the feedback effect is directed properly. Suppose that 2a^(J) comes out negative for an X in F + . In general some <p's give zero values for < p(X ), and their coefficients clearly cannot be blamed for the bad total. In fact, changing these coefficients might do harm in relation to other X's and does no good in relation to the current X. Thus we should increase a ^ only if <p(X) = 1. We should like a procedure for doing this whose mathematical form is clear enough to allow simple analysis and whose power is great enough to yield a reasonable success. The procedure given in §11.1 achieves both these goals, but we will first make a few introductory remarks. Learning 11.0 ] 1 63] 11.0.1 Coefficients and Vectors It is convenient to think of the set of coefficients { a fP \, ordered in an arbitrary but fixed way, as a vector in |<£|-dimensional space. Denote this vector by A. Similarly the set [<p(X)\, ordered in the same way, can be taken as a vector whose components are the values of the <p(Xy s. We denote this vector by <£(JT). Now the operation of increasing those coefficients that correspond to non- zero values of <p(X) is neatly performed by merely adding the vector <t>(V) to the vector A. If the sum had come out positive for X in F ~, we would subtract #(V) from A. A priori , any procedure of this sort runs the risk of oscillating wildly. An adjustment of the coefficients in the appropriate direc- tion for one figure might undo the previous adjustment for an- other figure. Thus our intuition about whether it will work is in- fluenced by two conflicting ideas drawn from experience with cybernetic situations: simple error-correcting feedback does often work; on the other hand, the process involves a search in a |T| - dimensional space and our experience with other schemes for “hill-climbing” makes us acutely aware of the dangers that beset such procedures. Close analysis is needed. This question of whether simple feedback will work can be posed in other words closely related to our main theme. The condition to be satisfied by the set of coefficients { a is defined globally in relation to the entire set of figures. On the other hand the “correction” procedure is highly local in the sense that each change made to the current values of these coefficients is based on consideration of just one figure. Thus the prob- lem of finding conditions under which the procedure will make the a ^ converge to globally satisfactory values belongs to the study of the rela- tion between apparently global and apparently local computations. In this chapter we will show that very small refinements will turn the simple feedback principle into a workable “training” or error- correction procedure. The main theorems about this are already fairly well known. Our main concern will be to understand why it works. By analyzing it from several points of view, its mechanism will become transparent and its logic obvious. In our discussion of recognizability of figures we have tried to replace vague formulations of questions about whether percep- trons are “good” or “bad” recognizers by an analytic theory that shows why perceptrons succeed in some cases and must fail in [164] 11.0 Learning Theory others. Although we do not have an equally elaborated theory of “learning,” we can at least demonstrate that in cases where “learning” or “adaptation” or “self-organization” does occur, its occurrence can be thoroughly elucidated and carries no sugges- tion of mysterious little-understood principles of complex sys- tems. Whether there are such principles we cannot know. But the perceptron p r o vides no evid e n c e; and ' TT T rr 3uecc3s - i r r analyzing"!! adds another piece of circumstantial evidence for the thesis that cybernetic processes that work can be understood, and those that cannot be understood are suspect. 11.1 The Perceptron Convergence Theorem Consider the following program in which the vector notation A • is used in place of our usual “2 notation. start: Choose any value for A. test: Choose an X from F + U F. If X e F + and A • > 0 go to test. If X e F + and A • < 0 go to add. If X e F and A • $ < 0 go to test. If X e F and A • <F > 0 go to subtract. add: Replace A by A 4- &(X). Go to TEST. subtract: Replace A by A - ^(A'). Go to TEST. We assume until further notice that there exists a vector, A*, with the property that if Xe¥ + then A* • ^(A') > 0 and if X e F then A*-^>(A r ) < 0. The perceptron convergence theorem then states that whatever choice is made in start and whatever choice function is used in test, the vector A will be changed only a finite number of times . In other words, A will eventually assume a value A 0 for which A 0 • always has the proper sign, that is, the predicate f = [A 0 • $ > 01 will have the property: X e ¥ + implies f(X) = 1, X e ¥~ implies f(X) = 0. Learning 11.1 [165] This is often expressed by saying that the predicate \p(X) separates the sets F + and F~. The convergence theorem can be ^loosely stated as: if the sets are separable (that is, if there exists a “solu- tion” vector A*), then the program will separate them (that is, it will find a solution vector A 0 which may or may not be the same as A* ). Because we are now concerned more with the sets of coefficients [ccp] than with the nature of itself or the geometry of figures in R , it will be convenient to think of the functions in L(< f>) as associated with the sets {aj regarded as vectors whose base vectors are the <^’s in <F. Warning: the vector-space base is tne set of y s, and not the points of R\ Although in this chapter we will think of the forms as elements of a vector space, one should remember that the set L($) of ^’s isn’t a vector space, and that each \p e L(<f>) can be represented by many A vectors. f In this vector-space context, the classes F + and F of figures are mapped into classes of vectors, which we will still call F + and F The mapping from pictures to vectors may, of course, be degen- erate, for we could have two figures X ^ X' for which $(X) = <£(A"'): the original figures are “seen” only through the <£>’s, and some details can be lost. We will now discard the restriction on the <?’s that their values be either 0 or 1. The ^-functions may now take on any real, positive or negative values and, for different A°s, each y? may tit may be observed that vector geometry occurs only here and in Chapter 12 of this book. In the general perceptron literature, vector geometry is the chief mathematical tool, followed closely by statistics — which also plays a small role in our development. If we were to volunteer one chief reason why so little was learned about perceptrons in the decade that they have been studied, we would point toward the use of this vector geometry! For in thinking about the 2a,-f/s as inner products, the relations between the patterns \X\ and the predicates in L(<F) have become very obscure. The A-vectors are not linear operators on the patterns themselves; they are “co-operators,” that is, they operate on spaces of functional operators on the patterns. Since the bases- ^-classes — of their vector spaces are arbitrary, one can’t hope to use them to discover much about the kinds of predicates that will lie in an L(T). The important questions aren’t about the linear properties of the /.(<£)’ s, but about the orders of complexities in computing pattern qualities from the information in the \<p{X)\ set itself. [166] 11.1 Learning Theory have any number of different values. So we can think of F + and F~ as two arbitrary clouds of points in <J>-space. The main danger in allowing this generality is that the feedback pro- cedure might be overwhelmed by vectors too large or stalled by vectors too small, so instead of adding or substracting 3> itself, we will later use instead the unit-length vector in the same direction: $ = so that |i| = 1. 1*1 If the sets F + and F“ are infinite the angles between pairs of vectors, one from each set, can have zero as a limit. In that case there is only one solution -veete^and the program may not find it. The conditions of Theorem 11.1 will exclude this possibility. The case-analysis in test of the program just described is over- complicated. The following program has the identical behavior: start: Choose any value for A 0). test: Choose a <f> from F + U F". If $ e F + and A • & > 0 go to test. If e F + and A • < 0 go to add. Replace * by If $ t F" and A • * > 0 go to test. If $ 6 F~ and A • ^ < 0 go to add. add: Replace A by A + $. Go to TEST. This is equivalent because (1) we have reversed the inequality signs in the part of test following changing <F, so all decisions will go the same way; (2) the effect of “go to add” is the same as “go to subtract” with reversed sign of <i>. Now, “replace by — is executed if and only if $ f F" and since the in- equality conditionals now have identical outcomes we can replace the program by the still equivalent program: Learning 11.1 [167] start: Choose any value for A. test: Choose a $ from F + U F~; If $ e F" change the sign of <F. If A • $ > 0 go to test; otherwise go to add. add: Replace A by A + <F. Go to TEST. In other words the problem of finding a vector A to separate two given sets F + and F~ is not really different from the problem of finding a vector A that satisfies $ ( F =► A • > 0 for a single given set F, defined as F + together with the negatives of the vectors of F~. We use these observations to simplify the program and statement of the convergence theorem : for simplicity we will state a version that uses unit vectors. Theorem 11.1: Perceptron Convergence Theorem: Let ¥ be a set of unit-length vectors . If there exists a unit vector A* and a number <5 > 0 such that A* • > 8 for all in F, then the program start: Set A to an arbitrary <i> of F. test: Choose an arbitrary 4> of F, and if A • $ > 0 go to test; otherwise go to add. add: Replace A by A + <F. Go to TEST. will go to add only a finite number of times. Some readers might be amused to note that the proof of this theorem does not use any assumptions of finiteness of the set F or the dimension of the vector space. This will not be true of later sections where the compactness of the unit sphere plays an apparently essential role. [168] 11.1 Learning Theory Corollary: We will generally assume that the program is pre- sented a sequence such that each f f F repeats indefinitely often. Then it follows that it will eventually find a “solution” vector A, that is, one for which A • <£ > 0 for all $ e F. This will not, of course, necessarily be A*, because A* is an arbitrary solution vector. All solution vectors form a “convex cone,” and the program will stop changing A as soon as it pene- trates the boundary of this cone. [Convex cone: a set S of vectors for which ( 1 ) a e S =* ka e S for all k > 0, (2) a e S and ft e S => (a + ft) e S. It is not a vector subspace because of the k > 0 condition.] 11.2 Proof of the Convergence Theorem 11 . 2.1 Define It may help some readers to notice that C7(A) is the cosine of the angle between A and A*. Because |A* | = 1, we have G( A) < 1. Consider the behavior of G(A) on successive passes of the pro- gram through add. A* • A, +l = A* - (A, + *) = A* • A r + A* • # > A* • A, + <5; hence after the nth application of add we obtain A* • A„ > nd. thesis Thus the numerator of (7(A) increases linearly with the number of changes of A, that is, the number of errors. Learning 11.2 [169] As for the denominator, since A, • <£ must be negative (or the program would not have gone through add) |A /+ i | 2 = A /+1 • A /+ ] = (A, + *) • (A, + *) = | A, | 2 + 2A f • # + |$| 2 < |A,| 2 + 1, and after the nih application of add, | A„ | 2 < n. ANTITHESIS Combining the results thesis and antithesis, we obtain ™ x A* • A„ nb c<A,) 'irr vn- But G(A) < 1, so this can continue only so long as Vn 8 < that is, n < 1 / 8 2 . This completes the proof. Figure 1 1 .2 The radial increase must be at least £4n-a moun t, yet the new vector must remain in the shaded region; this becomes impossible when the region, whose thickness varies inversely with \A\, becomes thinner than 8. [170] 11.2 Learning Theory Figures 11.2 and 11.3 show some aspects of the geometry of the rate of growth of |A|. They are particularly interesting if one wishes to look at the algebraic proof in the following dialectrical and slightly perverse form. Inequality antithesis can be read as Figure 11.3 The extreme case in which the bound \A\ | = \/~n is obtained. saying that |A„ | increases more slowly then the square root of n. On the other hand Inequality thesis can be turned (via the Cauchy-Schwartz inequality) into an assertion that |A„ | grows linearly with n. This leads to a contradiction: |A„ | must grow, but cannot grow fast enough. 11.3 A Geometric Proof (Optional) We are given a (unit) vector A* with the property A* • <£ > <5 for all $ 6 F. This means that every vector <1> in F makes an angle 6 $ with A* for which cos 6* > 5. If we choose d* > 0 to be smaller than any Learning 11.3 [171] of the 0* s, then every vector V within 0* of A* has the property V • > 0 for all $ t F. Therefore any vector V within the circular cone with base angle d* from A* will be a solution vector that will cause the program to stop changing. Now consider the vector A computed within the program. At each stage A is a sum of members of F. Thus A* • A = A* • (#, + <F 2 + •••)> 0. Let this page represent the plane containing A* and A. If we take A* as a unit vector oriented vertically, the above inequality shows that A must be oriented into the upper half-plane: We should like to show that each time the program passes through add, A is brought closer in direction to A*. Unfortu- nately, this is not strictly true. Figure 11.4 shows, however, that it will “normally” happen; and we should understand this normal case before closing off the details to obtain a rigorous proof. When add is used a vector $ will be added to the current value of A, say A,, to obtain a new value of A, say A, + i = A, + #. We know two facts about A*$ > 0, A,$ < 0. [172] 11.3 Learning Theory Now consider the projection of # on the plane of the paper and placed with its origin at the end of A, (in preparation for the usual geometric picture of vector addition). The first condition states that the end of must be above the dotted line and the second condition states that it must be below the dashed line. Thus, it lies as shown and points from the end of A, towards the direction of A*. If we consider the right cone generated by rotating A, about A*, it is clear that $ itself (of which $ N is the projection) runs into the cone. The proof of the theorem would be complete except for the observation that # might leave the cone again and so allow A, + I to have a larger angular separation from A* than did A,. Figure 1 1.5 shows how this might happen. Figure 1 1 .5 Learning [ 11 . 3 ] [173] But the overshoot phenomenon is not fatal, for it can occur only a limited number of times, depending on 6 *. To prove this consider the cone generated by rotating A about A*. Because # always has a vertical component <£ • A* > <5, the height of the cone increases each time A is changed. If the angle between A and A* remains greater than d* (and if not, the proof is finished!), the rim of the cone will come to have indefinitely large radii. Now let us look down, along A*, at the projection $ of <£ on the top of the cone: we will show that the end of $ must lie at least a distance d toward A* from the tangent line. Also, since |#| =1, the end of $ must lie inside a unit circle drawn around the end of A (see Figure 1 1 .6). Thus the end of # must lie in the shaded region. When the cone rim gets large enough, the shaded region will lie entirely within the cone, and so will 4>, and therefore also the end of <£ which is directly above it. So it remains only to show where the magic distance d comes from. [174] 11.3 Learning Theory To see this, we now look along the line tangent to the cone-rim through A (see Figure 11.7). Now the end of $ must lie within the shaded region defined by (1) the plane orthogonal to A, and (2) a plane orthogonal to A* and lying <5 above A again, because Figure 1 1.7 A * • $ > <5. Thus, the end of $ cannot come closer than the in- dicated distance d to the tangent. Because A is a sum of #’s it can- not ever make an angle greater than - 6* with A*, and this gives a positive lower bound to d. So, after a finite “transient” period, the A’s remain in a “vertical” cylinder, which must eventually go inside the acceptance cone around A*. This proves that, eventually, A must stop changing. Thus Theorem 11.1 is proved. Learning 11.4 [175] 11.4 Variants of the Convergence Theorem The convergence theorem has a large number of minor variants. It is easy to adapt our proof to cover any of the following forms in which it occurs in the literature on perceptrons: (1) Instead of assuming F to consist of unit vectors one can as- sume it to be a finite set, or to satisfy an upper and lower bound on length, that is, 3 a,b, such that 0 < a < |4>| < b for all $ e F. (2) Instead of replacing A by A + one can replace it by A -h &#, where k is a real number chosen by any one of the rules: k is a constant > 0 . k = 1/ | 4 >| , that is, add a unit vector. If c = 1 then k is just enough to bring (A + k$) • $ out k = c ^ - of the negative range. Or, one can use any value for c be- |4>| 2 tween 0 and 2. [Agmon, 1954] These and similar modifications do not change the theorem in the sense that A will still becom e , after a finite -n umber -o f tr a n sfers approach, cl to A - perrir^oluk^ IIuvvlvu the a c tual num b e r- of tra t re - feffr-wilL- be altered . It would also be interesting to compare the relative efficiency of the “local” perceptron convergence program soio-hon 15 with more “global” analytic methods, for example, linear pro- v ' £j€yK gramming, for solving the system of inequalities in A: A - $ > 0, all 4> in F. 11.4.1 More Than Two Classes A more substantial variation is obtained by allowing more than two classes of input figures. Let Fj, F 2 , ... be sets of figures and suppose that there are vectors A* and 5 > 0 such that $ e F/ implies that for all j ^ /, Af • $ > A* • $ + 6. The perceptron convergence theorem generalized to this case as- sures us that vectors with the same property can be found by following the usual principle of feedback: whenever one runs into a figure $ in F,- for which A, • < A 7 - 4> for some j\ A, must be “increased” and A ; “decreased.” [176] 11.4 Learning Theory This idea is expressed more precisely in the program: start: Choose any nonzero values for A ,, A 2 , test: Choose 1,7, and $ e F,. If A, • $ > A j • # go to test; otherwise go to change. change: Replace A/ by A/ + $. Replace A j by A y - - 4>. Go to TEST. The generalized theorem states that the program will go to change only a finite number of times. But this is possible only if the machine eventually stops making mistakes, that is, eventually every # in F, will make A,-# > A,-#, for all 7^2 To prove this, let A,* . . . A,-* . . . A/ . . . A m * have the required property, and define A* to be the vector (in a larger space) defined by stringing together all their coefficients. Also, for each define 4 >ij to be the vector that contains 4> and in the ith and yth blocks, with zeros elsewhere. Apply Theorem 11.1 to this large space. 11.5 Application: Learning the Parity Predicate ^ PARITY As an example to illustrate the convergence theorem, we shall estimate the number of steps required to learn the coefficients for the parity predicate. We have shown in §10.1 that the solution vector with the smallest coefficients can be written I R| iertts ('f 1 ) terms A = '(2 |/?l ,2 |/?l 't .r?,2 , * , “-C. . . , 1). The length of this vector is given by Learning 11.5 [177] The corresponding unit vector is then The analysis of §10.1 shows that A • $ is 1 or -1. Since # has 2 |/J| coefficients, each 0 or 1, we have > i = . 1 So we can take 1/VlO 1 * 1 as 6. The number n of corrections is then bounded by n < Xr < 10 |Jf| . 5 2 We obtain a lower bound of 5 1 * 1 for n by observing that |Aj must be at least 5 |/?l/2 and that |A„ | < n. Combining these we have 5 1 * 1 < n < 10 |/?i It is worth observing that if the convergence program had added $ instead of 4> we would have obtained, 5 '*' , < n < 10 |R| that is, (f) < n < 10 |Jf| . max |#| More analysis would be necessary to decide whether this modi- fication would actually result in more rapid learning. In any case it is clear that the learning time must increase exponentially as a function of n. These inequalities give bounds on the number n of corrections or, what comes to the same thing, of errors. A calculation of the total number of rounds of the program must take account of the [178] 11.5 Learning Theory decreasing error rate as learning proceeds. It is, however, easy to see that the number M(r) of rounds needed to reduce the propor- tion of errors to a fraction r should satisfy the inequality M(r) < n/r on the assumption that the figures are presented to the ma- chine in random order. Thus it should take something less than 10 |/?l+2 rounds to achieve a 1 percent error rate. 11.6 The Convergence Procedure as Hill-Climbing It is instructive to examine the relation of the convergence pro- cedure to the general problem of “hill-climbing.” There, too, one tries to find an apparently globally defined solution (that is, the location of the absolute summit) by local operations (for example, steepest ascent). Success depends, however, on the extent that the summit is not as globally defined as it might appear. In cases where the hill has a complex form with many local relative peaks, ridges, etc., hill-climbing procedures are not always advan- tageous. Indeed, in extreme cases a random or systematic search might be better than a procedure that relentlessly climbs every little hillock. In a typical hill-climbing situation one tries to maximize a func- tion G( A) of a point A in ^-dimensional space. The simplest procedure computes the value of the “altitude” function G for a number of points A, + in the vicinity of the current point A,. On the basis of these experiments, a value # is chosen and A, + # is taken as A /+1 . The algorithm for the choice of # varies. It might, for example, use unit vectors in the directions of the axes as the #/, compute the direction of steepest ascent and take $ as the unit vector in this direction. A simpler procedure might take as <£ the first unit vector it finds with the property that G( A t + #) > G(A t ). The choice of the appropriate algorithm will depend on many considerations. If, however, the hill (that is, the surface defined by G) is sufficiently well behaved, any reasonably sophis- ticated algorithm will work. If the hill is very bad, no ingenious local tricks will do much good. See Figure 1 1 .8. Now the perceptron convergence procedure can be seen as a hill- climbing algorithm if we define the surface G by A good hill with a bad algorithm (example suggested by Oliver Selfridge). Hill-climbing along the two axes won’t work, for if A is a point on the ridge, both G( A + $ 1 ) and G(A + $ 2 ) are less than G(A). The “resolution” of the test vectors is too coarse for the sharpness of the ridge. Figure 1 1.8 [180] 11.6 Learning Theory C(A) . It differs from the usual form in two superficial respects. First, the algorithm has no procedure for systematically exploring the effects of moving in all directions from the current point A,. Second, it never actually has the value of the object function (7(A) since A* is, by definition, unknown. Nevertheless, the logic of its operation is essentially like the simpler of the two hill-climbing algorithms mentioned in the previous paragraph: the step from A, to A,+ i = A, + $ is based on evidence indicating (albeit indirectly) that (7(A /+1 ) is larger than (7( A,). One would expect its success to be related to the shape of the surface (7(A). And, indeed, a little thought will show that this surface has none of the pathological features likely to make hill-climbing difficult: there are no false local maxima, no ridges, no plateaus, etc. This is most easily seen by considering the function (7(A) on the unit sphere, where A = A/ | A | . For A satisfying A • A* > 0 (the only ones we need consider) this surface is an ^-dimensional cone. It has a single peak at A = A*, connected uniform contours, straight lines of steepest ascent; in short, all features a hill-climber could desire. Thus, we see, from another point of view, that the convergence theorem is neither as surprising nor as isolated a phenomenon as it might at first appear. 11.7 Perceptrons and Homeostats The significance of the perceptron convergence theorem must not be reduced (as it often has been, in the literature) to the mere statement: If two sets of figures are linearly separable then the con- vergence theorem procedure can find a separating predicate. For if OYiediJ not A all one -w anted is to fi -n d a - s e parating predicat e, a more trivial care also procedure would suffice. practical/ 0 b serve fj rst that jf there exists a vector A* such that A* • <£ > S > 0 for all <t> e F, then there exists a vector A' with the same property and with integer components. We can therefore find a suitable A' by the simple program: Learning 11.8 [181] start: SetA 0 = 0. test: Choose $ e F. If A • # > 0 go to test; otherwise go to generate. generate: Replace A by T( A) where T is any trans- formation such that the series 7X0), 7X7(0)), 7X 7X7X0))), . . . , includes all pos- sible integral vectors. Go to TEST. Clearly, the procedure can make but a finite number of errors before it hits upon a solution. It would be hard to justify the term “learning” for a machine that so relentlessly ignores its experience. The content of the perceptron convergence theorem must be that it yields a better learning procedure than this simple homeostat. Yet the problem of relative speeds of learning of perceptrons and other devices has been almost entirely neglected. There is not yet any general theory of this topic; in §11.5 we discussed some of the problems encountered in estimating learning times. Some other simple methods of “learning” will be discussed in Chapter 12. The logical theory of homeostats, that is, enumerative pro- cedures like the one mentioned just above, is discussed by W. Ross Ashby in the book Design for a Brain. 11.8 The Nonseparable Case There are many reasons for studying the operation of the percep- tron learning program when there is no A* with the property A* • 4> > 0 for all <FeF. Some of these are practical reasons. For example, one might want to use the program to test whether such an A* exists, or one might wish to make a learning machine of this sort and be worried about the possible effects of feedback errors and other “noise.” Other motives are theoretical. One can- not claim to have completely understood the “separable case” without at least some broader knowledge of other cases. [182] 11.8 Learning Theory Now it is perfectly obvious that Theorem 11.1 cannot be true, as it stands, under these more general conditions. It must be possible for A to change infinitely often. However, the fate of A is not obvious: will | A | increase indefinitely? Will A take infinitely many values or will it cycle through or in some other way remain in some fixed finite set? In the next sections we shall prove that | A | remains bounded. To be more precise we introduce the following definitions: Let F be a finite set of vectors. Then An F-chain is a sequence of vectors A u A 2 , . . . , A„, for which f A, + 1 = A / 4- *, J • A/ < 0, l*/«F. An F-chain is proper if, for all /, I A/l > I A 1 1 . We will prove that F-chains beginning with large vectors cannot grow much larger. 11.9 The Perceptron Cycling Theorem For any e > 0 there is a number N = A(e, F) such that if A, . . . , A' is a proper F-chain and |A | > A, then | A ' | < | A | + c. Corollary 1: The lengths | A | of vectors obtainable by executing the learning program, with a given F and a given initial vector, are bounded. If the finite set of vectors in F are constrained to have integer coordinates, then the process is finite-state! The plausibility of these assertions is easily verified by examining Figure 11.10. As | A | grows it becomes increasingly difficult to find a member of F satisfying both A • < 0 and |A + <£| > | A | . The formal proof is given in §11.10 and uses induction on the dimension of the vectors in F. Learning 11.10 [183] 1 CW proof <7f -tk(5 tleoveni is cowp healed a, ad U^core. So ave Tk* ofh« ^ proofs we Kai/e since seen. Sovel^ sow*o*e wiK ^mc) a, simpler a.®o coccc k • Non - s^eo i<t (isks s koo (d ^ E© S'l. The theorem (in the form of Corollary 1) was apparently first conjectured by Nilsson, and proved by Efron. Terry Beyer formu- lated the conjecture quite independently. 11.10 Proof of the Cycling Theorem The proof depends on some observations about the effect, on the length of an arbitrarily long vector A, of adding to it a short vector C whose length is fixed. 11.10.1 Lemmas* If C is any vector, and A is very large compared to C, then |A + C| - [A 1 ^ A ■ C. More precisely, define A = |A + C| — |A |. Then for any e > 0, if we choose | A | > |C| 2 /e then the difference between A and A • C will be less than e. It is easy to read from the infinitesimal geometry of Figure 11.9 that |A • C — A | < | B | sin 0 ^ |B| 2 /|a| < |C| 2 /|A|, when |A| » I C | . A formal proof is hardly necessary, but if we define x = |A + C| and y = | A | we can use the identity x 2 - y 2 = 2 y(x - y) + (x - y) 2 to obtain 2A • C + |C| 2 = 2| A | A + A 2 ; hence 2| A |(A • C - A) = A 2 - | C | 2 . Since | A | < |C| we then have I A • C - A| < |C | 2 / | A |. * Since this shows that A ^ A • C when |A| » |C| we can conclude that Lemma 1: We can make A as small as we want by setting a lower We denote by A the unit length vector along the direction of A. [184] 11.10 Learning Theory bound on |A| and an upper bound on A • C, that is, by taking A large, and nearly orthogonal to C. Lemma 2: We can make the angle (A, A + C) as small as we like by increasing |A| because sin 0 < |C| / |A|. Lemma 3: If a relatively small vector C is bounded away from orthogonal to a very large vector A, with negative inner product, then the A is bounded away from (negative) zero. In fact, if A • C < -6 < 0, then if we take |A| > (2/6) |C| we have (be- cause A approaches the negative quantity A • C |C|), A • C|C| < A < i A • C|C| <0. Thus A < - \ 6 |C|. I Figure 1 1.10 Finally, we need one more substantial Lemma: Lemma 4: The projection of a proper F-chain A { , . . . , A k onto a hyperplane containing F is a proper F-chain. Moreover, the increase in length, |A*| - I A , | is not greater than that of the projected chain. proof: Let A h . . . , A* be a proper chain. Let H be a hyperplane containing F, and B the normal to H. Remember that B • = 0 for all # 6 F. Write A i = A i + B To show A 1? . . . , A k is an F-chain, let A /+1 = A, + where A ; • # < 0. Now A,+ i = Aj+ \ + x i+ 1 B = (A / + ^) + X/B. Learning 11.10 [185] Then, by the orthogonality of ft to all of A„ A /+1 , and <F, */+l = */ and A l+1 = A,- + $. Finally, putting B = jc, ft we get 0 > A, • # = (A, + B)*$ = A/ • <l> + B • <£ = A,- • 4>. To show that the A’s form a proper F-chain we must also verify that | A, | > | A ,| . This follows immediately from I A 1 2 = | A 1 2 + 2A,*B + |B| 2 = I A, | 2 + |B| 2 . Finally it is easy to see that |A*| - I A , | = VlA^ + lBl 2 - a/ | A i f 2 + |B | 2 < |A*| - |A,| so the latter must be positive. Q.E.D. 11.10.2 Proof of Cycling Theorem We prove the theorem by induction on the dimension of the vector space. base: The theorem is obviously true in E u the one-dimensional case, for there the vectors are simply real numbers and # • A < 0 means that # and A have opposite signs. If | A | > max |$| then F I# + A| < | A | for # • A < 0. So eventually |A,| must be less than max #. F induction: Assume for an inductive proof that the theorem is true in E n _ x . Note that this implies the existence of a bound [186] 11.10 Learning Theory M n . x such that any F-chain, A,, . . . , A ffl in E n _ ] can grow at most M n _ \ in length, that is, |Aj < |Aj| + M n _ Choose any direction (that is, unit vector) A in E n . Our first subgoal will be to construct an open neighborhood V(A) on the unit sphere from which the growth of chains is bounded; in fact, for any e > 0, there is a bound N( A) such that, if |B| > N(A) and B e V(A) then any proper F-chain starting at B can grow at most e in length. Since the open sets V(A) cover the surface of the unit sphere, and since the sphere is compact, it will follow that we can find one N that will work in place of all the N(k)'s and the theorem will be proved. Let H(A) be the hyperplane orthogonal to A and let H(A) be the complement of H (A), that is, H (A) = E n - H(A). Since F is finite, there is a number <5 > 0 such that |# • A| > 26 for all 3> in H(A) Pi F. By continuity there is a neighborhood V'(A) such that if B e V'(A)and * e H(A) p F then |* • B| > 6. There is also a number b such that |<1>| < b for all e F. We can now deduce from Lemma 3 that there are numbers 6' and n( A) such that if r < (1) |B| >_«(A) (2) 6 H(A) n (3) B e V'(A) (4) €> • B <0 then (5) |B + #| < |B These are the conditions of Lemma 3, where , B and <I> play the role of A and C of the lemma. Note that (2) keeps $ from being per- pendicular to A and (4) keeps $ from being perpendicular to B. - 5 ’. We shall consider a proper F-chain, B„...,B„... with B j+t -Bj+*j with B, very near A and |B, | > «( A). In particular, let ij > 0 be a number such that the diameter of V'(A) is bigger than 77 . Let V (A) be a neighborhood of A on the unit sphere such that the diameter of V(A) is less than 77 / 2 . So V(A) C V'(A). We now take B, such that B, « V (A) and |B,| > n(A) though we will shortly Learning 11.10 [187] change this lower bound on the magnitude of B, to the desired N( A). By (5) above, the chain cannot be proper unless e H(A). Thus the chain must start growing from H(A). We will see that not only <$!, but all the other <f>’s must be in H(A); hence the chain’s growth is bounded by e. For suppose that *j\ C H(A) and # y+1 c H(A). Then |B ;+1 | will be less than |Bi| by at least 8 '/ 2 . To see this we use Lemmas 1 and 2 and the inductive assumption. Since the pro- jections B By of B By form a proper F-chain in the (n — l)-dimensional space H(A), |By| < |Bj| + M n _ u where M„_ \ is the bound obtained by the induction assumption for the next lower dimension. Now, if 77 is chosen small enough and if N( A) is chosen large enough the conditions of Lemmas 1 and 2 are satisfied with the following values: we use <l>, + • • • + for “C,” so that*|C| < M n _ 1; we use B! for ‘\4”; and we use a smaller 6' = min (e, <5'/2) for “c.” It follows from (5) and the fact that |By| > |Bj| > N(A) that I By 4. 1 1 < | By | - 8 ' < |Bj - 8'/2 so that the jump from By to By +1 decreased the length of the B-vector more than the first j steps increased it! Thus the chain cannot be proper unless all the <£’s belong to H(A). But in this case the increase in length of the whole chain is bounded by e. This achieves the first sub- goal. The surface of the sphere is covered by the V(A)’s. By com- pactness, there is a finite subcovering. Let N be the maximum of the corresponding jV(A)’s. It follows that for any proper chain B,...,B, |B| > N =* |B'| < |B | + e. This completes the proof of the cycling theorem. ** The\re is a, gap A* ev^ ^ poinVeJ oO t qui ve cu^e i ty H.£>. %/ocL S.A ■ Lev'll : ?\roc, Aw?/. it ?f .3^1- *35. Linear Separation and Learning 12 12.0 Introduction The perceptron and the convergence theorems of Chapter 1 1 are related to many other procedures that are studied in an extensive and disorderly literature under such titles as learning machines, MODELS OF LEARNING, INFORMATION RETRIEVAL, STATISTICAL DE- CISION theory, pattern recognition and many more. In this chapter we will study a few of these to indicate points of contact with the perceptron and to reveal deep differences. We can give neither a fully rigorous account nor a unifying theory of these topics: this would go as far beyond our knowledge as beyond the scope of this book. The chapter is written more in the spirit of inciting students to research than of offering solutions to prob- lems. 12.1 Information Retrieval and Inductive Inference The perceptron training procedures (Chapter 1 1) could be used to construct a device that operates within the following pattern of behavior: Answers During a “filing” phase, the machine is shown a “data set” of Az-dimensional vectors — one can think of them as n ~ bit binary numbers or as points in rt-space. Later, in a “finding” phase, the machine must be able to decide which of a variety of “query” vectors belong to the data set. To generalize this pattern we will Linear Separation and Learning 12.1 [189] use the term A file for an algorithm that examines elements of a data set to modify the information in a memory. A file is designed to pre- pare the memory for use by another procedure, A find , which will use the information in the memory to make decisions about query vectors. This chapter will survey a variety of instances of this general scheme. We will begin by relating the perceptron procedure to the simplest such scheme: in the complete storage procedure A fi ,e merely copies the data vectors, as they come, into the memory. For each query vector, A find searches exhaustively through memory to see if it is recorded there. 12.1.1 Comparing perceptron with complete storage Our purpose is to illustrate, in this simple case, some of the ques- tions one might ask to compare retrieval schemes: Is the procedure universal? The perceptron scheme works per- fectly only under the restriction that the data set is linearly separable. Complete storage is universal: it works for any data set. How much memory is required? Complete storage needs a mem- ory large enough to hold the full data set. Perceptron, when it is applicable, sometimes has a summarizing effect, in that the information capacity needed to store its coefficients {a,} is sub- stantially less than that needed to store the whole data set. We have seen (§10.2) that this is not generally true; the coefficients for ^parity may need much more storage than does the list of accepted vectors. How quickly does A find operate? The retrieval scheme — exhaustive search— specified for complete storage is very slow (usually slower than perceptron’s A find , which must also retrieve all its coefficients from memory). On the other hand, very similar pro- cedures could be much faster. For example, if A file did not just store the data set in its order of entry, but sorted the memory into numerical order, then A find could use a binary search, reduc- ing the query-answer time to log 2 (|data set|) [190] 12.1 Learning Theory memory references. We shall study (in §12.6) A file algorithms that sacrifice memory size to obtain dramatic further increases in speed (by the so-called hash-coding technique). Can the procedure operate with some degree (perhaps measured probabilistically ) of success even when A file has seen only a subset of the data set — call it a “data sample '*? Perceptron might, but the complete storage algorithm, as described, cannot make a reasonable guess when presented with a query vector not in the data sample. This deficiency suggests an important modification of the complete storage procedure: let A find , instead of merely checking whether the query vector is in the data sample, find that member of the data sample closest to it. This would lead, on an a priori assumption about the “continuity” of the data set, to a degree of generalization as good as the perceptron’s. Unfortu- nately the speed-up procedures such as hash-coding cease to be available and we conjecture (in a sense to be made more precise in §12.7.6) that the loss is irremediable. Other considerations we shall mention concern the operation of Arti e . We note that the perceptron and the complete storage procedures share the following features: They act incrementally, that is, change the stored memory slightly as a function of the currently presented member of the data set. They operate in “real time” without using large quantities of auxiliary scratch-pad memory. They can accept the data set in any order and are tolerant of repetitions that cause only delay but do not change the final state. On the other hand they differ in at least one very fundamental way: The perceptron’s A fi i e is a “search procedure” based on feedback from its own results. The complete storage file algorithm is pas- sive. The advantage for the perceptron is that under some condi- tions it finds an economical summarizing representation. The cost is that it may need to see each data point many times. 12.1.2 Multiple Classification Procedures It is a slight generalization of these ideas to suppose the data set divided into a number of classes F!,...,F*. The algorithm Linear Separation and Learning 12.1 [191] A fii e is presented as before with members of the data set but also with indications of the corresponding class. It constructs a body of stored information which is handed over to A find whose task is to assign query points to their classes using this information. Example: We have seen (§11.3.1) how to extend the concept of the perceptron to a multiple classification. The training algorithm, A file , finds k vectors A,,..., A*, and A find assigns the vector <*> to F j if $ • A, > # • A i (all i ^ j). INNER PRODUCT Example: The following situation will seem much more familiar to many readers. If we think of each class F y as a “clump” or “cloud” or “cluster” of points in <£-space, then we can imagine that with each F 7 is associated a special point B ; that is, somehow, a “typical” or “average” point. For example, could be the center of gravity , that is, the mean of all the vectors in F ) (or, say, of just those vectors that so far have been observed to be in F )). Then a familiar procedure is: $ is judged to be in that F j for which the Euclidean distance I* - B;|. is the smallest . That is, each $ is identified with the nearest B- point. Now this nearness scheme and the inner-product scheme might look quite different, but they are essentially the same! For we have only to observe that the set of points closer to one given point B, than to another B 2 is bounded by a hyper- plane (Figure 12.1), and hence can be defined by a linear inequal- Figure 12.1 [192] 12.1 Learning Theory ity. Similarly, the points closest to one of a number of B/s form a (convex) polygon (Figure 12.2) and this is true in higher dimen- sions, also. Formally, we see this equivalence by observing that I* - Bj\ 2 = |*f - 2* • By + |By| 2 . Now, if we can assume that all the <l>’s have the same length L then the Euclidean distance ( B ) will be smallest when * B, - i|By| 2 = * By - 9j is largest. But this is exactly the inner-product, if the “threshold” is removed by §1.2.1 (l). To see that the inner-product concept loses nothing by requiring the $’s to have the same length, we add an extra dimension and replace each # = [<p \ , . . . , (p n ] by so that all have length L 2 = n. We have to add one dimension to the B’s, too, but can always set its coefficient to zero. 12.2 A Variety of Classification Algorithms We select, from the infinite variety of schemes that one might use to divide a space into different classes, a few schemes that illustrate aspects of our main subject: computation and linear separation. We will summarize each very briefly here; the re- mainder of the chapter compares and contrasts some aspects of their algorithmic structures, memory requirements, and commit- ments they make about the nature of the classes. Linear Separation and Learning 12.2 [193] Each of our models uses the same basic form of decision algo- rithm for A find . In each case there is assigned to each class F ) one or more vectors A,; we will represent this assignment by say- ing that A, is associated with F /(/) . Given a vector $, the decision rule is always to choose that F/o) for which A, -$ is largest. As noted in §12.1.2 this is mathematically equivalent to a rule that minimizes |<£ - A, | or some very similar formula. For each model we must also describe the algorithm A me that constructs the A/s, on the basis of prior experience, or a priori information about the classes. In the brief vignettes below, the fine details of the A fiIe procedures are deferred to other sections. 12.2.1 The perceptron Procedure There is one vector Aj for each class F y . A fi i e can be the procedures described in §1 1.1 for the 2-class case and in §1 1.4.1 for the multi- class case. 12.2.2 The bayes Linear Statistical Procedure Again we have one A y for each F,-. A fiie is quite different, however. For each class F y and each partial predicate <£>,-, define where p tj is the probability that <£>,• = 1, given that <l> is in F ; . Then define A j = (0 n wij, w 2J , ...). We will explain in §12.4.3 the conditions under which this use of “probability” makes sense, and describe some “learning” algo- rithms that could be used to estimate or approximate the h^/s. The bayes procedure has the advantage that, provided certain statistical conditions are satisfied, it gives good results for classes that are not linearly separable. In fact it gives the lowest possible error rate for procedures in which A fi i e depends only on condi- tional probabilities, given that the <^’s are statistically independent in the ; sense explained in §12.4.2. It is astounding that this is achieved by a linear formula. [194] 12.2 Learning Theory 12.2.3 The best planes Procedure In different situations either perceptron or bayes may be superior. But often, when the F/s are not linearly separable, there will exist a set of A, vectors which will give fewer errors than either of these schemes. So define the best planes procedure to use that set of A/s for which choice of the largest A ; * gives the fewest errors. By definition, best planes is always at least as good as bayes or perceptron. This does not contradict the optimality of bayes since the search for the best plane uses information other than the conditional probabilities. Unfortunately no practically effi- cient A fi | e is known for discovering its A/s. As noted in §12.3, hill-climbing will apparently not work well because of the local peak problem. 12.2.4 The isodata Procedure In the schemes described up to now, we assigned exactly one A- vector to each F-class. If we shift toward the minimum-distance viewpoint, this suggests that such procedures will work satisfac- torily only when the F-classes are “localized” into relatively iso- lated, single regions— one might think of clumps, clusters, or clouds. Given this intuitive picture, one naturally asks what to do if an F-class, while not a neat spherical cluster, is nevertheless semilocalized as a small number of clusters or, perhaps, a snake- like structure. We could still handle such situations, using the least-distance A find , by assigning an A-vector to each subcluster of each F, and using several A’s to outline the spine of the snake. To realize this concept, we need an A me scheme that has some ability to analyze distributions into clusters. We will describe one such scheme, called isodata, in §12.5. 12.2.5 The nearest neighbor Procedure Our simplest and most radical scheme assumes no limit on the number of A-vectors. A file stores in memory every $ that has ever been encountered, together with the name of its associated F- class. Given a query vector <F 0 , we find which $ in the memory is closest to and choose the F-class associated with that #. This is a very generally powerful method: it is very efficient on many sorts of cluster configurations; it never makes a mistake on an already seen point; in the limit it approaches zero error Linear Separation and Learning 12.3 [195] except under rather peculiar circumstances (one of which is dis- cussed in the following section). nearest neighbor has an obvious disadvantage — the very large memory required — and a subtle disadvantage: there is reason to suspect that it entails large, and fundamentally unavoidable, com- putation costs (discussed in §12.6). 12.3 Heuristic Geometry of Linear Separation Methods The diagrams of this section attempt to suggest some of the be- havioral aspects of the methods described in §12.4. To compen- sate for our inability to depict multidimensional configurations, we use two-dimensional multivalued coordinates. The diagrams may appear plausible, but they are really defective images that do not hint at the horrible things that can happen in a space of many dimensions. Using this metaphorical kind of picture, we can suggest two kinds of situations which tend to favor one or the other of bayes or perceptron (see Figure 12.3). The bayes line in Figure 12.3 tends to lie perpendicular to the line between the “mean” points of F_ and F + . Hence in Figure 12.3(a), we find that bayes makes some errors. The sets are, in fact, linearly separable, hence perceptron, eventually, makes no errors at all. In Figure 12.3(b) we find thdt bayes makes a few errors, just as in 12.3(a). We don’t know much about perceptron in nonseparable situations; it is clear that in some situations it [196] 12.3 Learning Theory will not do as well as bayes. By definition best plane, of course, does at least as well as either bayes or perceptron. From the start, the very suggestion that any of these procedures will be any good at all amounts to an a priori proposal that the F-classes can be fitted into simple clouds of some kind, perhaps with a little overlapping, as in Figure 12.4. Such an assumption could be justified by some reason for believing that the differences between F+ and F_ are due to some one major influence plus a variety of smaller, secondary effects. In general perceptron tends to be sensitive to the outer boundaries of the clouds, and relatively insensitive to the density distributions inside, while bayes weights all 4>’s equally. In cases that do not satisfy either the single-cloud or the slight-overlap condition (Figure 12.5), we can expect bayes to do badly, and presumably perceptron also. best plane can be substantially better because it is not subject to the bad influence of symmetry. But finding the best plane is Figure 12.5 Linear Separation and Learning 12.3 [197] likely to involve bad computation problems because of multiple, locally optimal “hills.” Figure 12.6 shows some of the local peaks for best plane in the case of a bad “paritylike” situation. Here, even isodata will do badly unless it is allowed to have one A- vector for nearly every clump. But in the case of a moderate number of clumps, with an A* in each, isodata should do quite well. (See §12.5.) Generally, we would expect perceptron to be slightly better than bayes because it exploits behavioral feedback, worse because of undue sensitivity to isolated errors. Figure 12.6 One would expect the nearest neighbor procedure to do well under a very wide range of conditions. Indeed, nearest neighbor in the limiting case of recording all 3>’s with their class names, will do at least as well as any other procedure. There are conditions, though, in which nearest neighbor does not do so well until the sample size is nearly the whole space. Consider, for example, a space in which there are two regions: P(f £ F + ) -- l-f -- ? [198] 12.3 Learning Theory In the upper region a fraction p of the points are in F+, and these are randomly distributed in space, and similarly for F_ in the lower region. Then if a small fraction of the points are already recorded, the probability that a randomly selected point has the same F as its nearest recorded neighbor is P 2 + q 2 = 1 - 2pq , while the probability of correct identification by bayes or by best plane is simply p. Assuming that p > \ (if not, just ex- change p and q) we see that Error BES T plane < Error NEARESTNE1GHBO R < 2 x Error BEST PLANE so that nearest neighbor is worse than best plane, but not arbitrarily worse. This effect will remain until so many points have been sampled that there is a substantial chance that the sampled point has been sampled before, that is, until a good fraction of the whole space is covered. On the other side, to the extent that the “mixing” of F + and F_ is less severe (see Figure 12.7), the nearest neighbor will converge to very good scores as soon as there is a substantial chance of finding one sampled point in most of the microclumps. Figure 12.7 A very bad case is a paritylike structure in which nearest neighbor actually does worse than chance. Suppose that $ e Fj if and only if v?, = 1 for an even number of /’ s. Then, if there are n <p' s, each $ will have exactly n neighbors whose distance d Linear Separation and Learning 12.4 [199] satisfies 0 < d < 1. Suppose that all but a fraction q of all pos- sible #’s have already been seen. Then nearest neighbor will err on a given if it has not been seen (probability = q) but one of its immediate neighbors has been seen (probability = 1 - q n ). So the probability of error is >q( 1 - q n ), which, for large n, can be quite near certainty. This example is, of course, “pathological,” as mathematicians like to say, and nearest neighbor is probably good in many real situations. Its performance depends, of course, on the precise “metric” used to compute distance, and much of classical statis- tical technique is concerned with optimizing coordinate axes and measurement scales for applications of nearest neighbor. Finally, we remark that because the memory and computation costs for this procedure are so high, it is subject to competition from more elaborate schemes outside the regime of linear separa- tion — and hence outside the scope of this book. 12.4 Decisions Based on Probabilities of Predicate- Values Some of the procedures discussed in previous sections might be called “statistical” in the weak sense that their success is not guaranteed except up to some probability. The procedures dis- cussed in this section are statistical also in the firmer sense that they do not store members of the data set directly, but instead store statistical parameters, or measurements, about the data set. We shall analyze in detail a system that computes — or estimates — the conditional probabilities p u that, for each class F,- the predicate <Pi has the value 1. It stores these p,-s together with the absolute probabilities /?, of $ being in each of the F/s. Given an observed <f>, the decision to choose an F, is a typical statistical problem usually solved by a “maximum likelihood” or Bayes-rule method. It is interesting that procedures of this kind resemble very closely the perceptron separation methods. In fact, when we can assume that the conditional probabilities p,, are suit- ably independent (§12.4.2) it turns out that the best procedure is the linear threshold decision we called bayes in §12.2.2. We now show how this comes about. 12.4.1 Maximum Likelihood and Bayes Law In Chapter 1 1 we assumed that each is associated with a unique F j. We now consider the slightly more general case in which the [200] 12.4 Learning Theory same # could be produced by events in several different F-classes. Then, given an observed # we cannot in general be sure which F j is responsible, but we can at best know the associated probabil- ities. Suppose that a particular # 0 has occurred and we want to know which F is most likely. Now if F, is responsible, then the “joint event” F 7 A # 0 has occurred; this has (by definition) probability (P (F, A 4>o). Now (by definition of conditional probability) we can write (P(F y A # 0 ) = (p (Fj) *(P($o i F 7 ). (1) That is, the probability that both F y and # 0 w iH happen together is equal to the probability that F, will occur multiplied by the prob- ability that if F, occurs so will # 0 - We should choose that F y which gives Formula 1 the largest value because that choice corresponds to the most likely of those events that could have occurred; F, A # 0 F 2 A # 0 • • • F k A # 0 - These are serious practical obstacles to the direct use of Formula 1 . If there are many different #’s it becomes impractical to store all the decisions in memory, let alone to estimate them all on the basis of empirical observation. Nor has the system any ability to guess about #’s it has not seen before. We can escape all these difficulties by making one critical assumption — in effect, assum- ing the situation closely fits a certain model — that the partial predicates of# = (<p i, . . . , <p m ) are suitably independent. 12.4.2 Independence Up to now we have suppressed the T’s of earlier chapters because we did not care where the values of the s came from. We bring them back for a moment so that we can give a natural context to the independence hypothesis. We can evade the problems mentioned above if we can assume that the tests <p,-( X) are statistically independent over each F-class. Precisely, this means that for any #(A") = (ipfX), . . . , (p m (X)) we can assert that, for each j, (P($ | F,) = (P(<p, | F,-) x • • • x | Fj). ( 2 ) Linear Separation and Learning 12.4 [201] We emphasize that this is a most stringent condition. For exam- ple, it is equivalent to the assertion that: Given that a <£ is in a certain F -class, if one is told also the values of some of the (p's, this gives absolutely no further information about the values of the remaining (p's. Experimentally, one would expect to find independence when the variations in the values of (p's are due to “noise” or measurement uncertainties within the individual ^-mechanisms: For, to the extent that these have separate causes, one would not expect the values of one to help predict the values of another. But if the variations in the (p's are due to selection of different X s from the same F -class, one would not ordinarily assume independence, since the value of each (p tells us something about which X in F has occurred, and hence should help at least partly to predict the values of other (p's: Figure 12.9 An extreme example of nonindependence is the following: there are two classes, F 1 and F 2 , and two <^’s, defined by < <Pi(X) a pure random variable wi£h (P (<pi(X) = 1) = Its value is determined by tossing a coin, not by X. (p,(X)ifXeF h 1 - <p,(X) if X e F 2 . [202] 12.4 Learning Theory Then <?(<P, A <p 2 l^i) = 5 . But <P((P\ I F\) • (P(<£> 2 1 F\) = i • i Notice that neither <p l nor <p 2 taken alone gives any information whatever about F! Each appears to be a random coin toss. But from both one can determine perfectly which F has occured, for <p\ = 2 implies Fj, while (f \ ^ <f 2 implies F 2 with absolute certainty. remark: We assume only independence within each class F y . So if X is not given, then learning one <^-value can help predict another. For ex- ample, suppose that if 1 = f 2 = 0 if X e Fj , <p 1 = f 2 = 1 if X t F 2 . These two <p s are in fact independent on each F. But if we did not know in advance that X e F\ but were told that <p\ = 0, we could indeed then predict that y? 2 = 0 also, without this violating our independence as- sumption. (If we had previously been told that T e F], then we could already predict the value of v? 2 ; in that case learning the value of <p\ would have no effect on our prediction of <p 2 .) 12.4.3 The Maximum Likelihood Decision, for Independent y?’s, Is a Linear Threshold Predicate! Assume that the y?/s are statistically independent for each F ; . Define Pj =<P(F,), Pij = (P(<Pi = 1 |Fy), Pij = 1 - Pa = 9 '{‘Pi = 0 |F,). Suppose that we have just observed a # = ((p j, . . . , <p m ), and we want to know which F y was most probably responsible for this. Then, according to Formulas 1 and 2, we will choose that j which maximizes Pj ■ n pu ■ ru, <Pj = I <^/ = 0 Linear Separation and Learning 12.4 [203] = P U <pi (1 —<Pj) = Pi . nw' • ii in- Because sums are more familiar to us than products, we will replace these by their logarithms. Since log x increases when x does, we still will select the largest of E <Pi • log — + (log Pj + X3°g Quj- (3) / R ij \ i / Because the right-hand expression is a constant that depends only upon j, and not upon the experimental <I>, all of Formula 3 can be written simply as 2 Wijfpi + Oj. (3') Example 1: In the case that there are just two classes, F s and F 2 , we can decide that X e F whenever 2 w i\<Pf 4 - 0 1 > 2 w i2 (pj + 0 2 , that is, when 2 (w n - w a )<Pi > (0 2 - 0 1), (4) which has the form of a linear threshold predicate \p = f 2 a i (f j - > ^1 . Thus we have the remarkable result that the hypothesis of inde- pendence among the <^’s leads directly to the familiar linear decision policy. Example 2 (probabilities of error): Suppose that for all /, p n = q i2 . Then p n is the probability that <pi(X) = \p{X) and q n is the prob- ability that (fi(X) ^ yp{X), that is, that makes an error in (individually) predicting the value of \p = \X e F { ). [204] 12.4 Learning Theory Then inequality 4 becomes ^w n (2<p, - 1) > log—. (4') i P 1 Now observe that the (2<p, — 1) term has the effect of adding or subtracting w n according to whether <pi = 1 or 0. Thus, we can think of the w’s as weights to be added (according to the <^’s) to one side or the other of a balance: The log (P 2 /P\) is the “ a priori weight” in favor of F 2 at the start, and each w n = log (pn/qn) is the “weight of the evidence” that<£>/= 1 gives in favor of Fi. It is quite remarkable that the optimal separation algorithm — given that the ^-probabilities are independent — has the form (Inequality 4) of a linear threshold predicate. But one must be sure to understand that if [2a^ > 0] is the “optimal” predicate obtained by the independent- probability method, yet does not perfectly realize a predicate \ p, this does not preclude the existence of a precise separation > 0'] which always agrees with \p. [This is the situation suggested by Figure 12.3(a).] For Inequality 4 is “optimal” only in relation to all A fi | e pro- cedures that use no information other than the conditional probabilities \pj\ and j pij), while a perceptron computes coefficients by a nonstatistical search procedure that is sensitive to individual events. Thus, if f is in fact in L(<t>) the perceptron will eventually perform at least as well as any linear-statistical machine. The latter family can have the advantage in some cases: 1. If \p j L(<t>) the statistical scheme may produce a good approximate separation while the perceptron might fluctuate wildly. 2. The time to achieve a useful level may be long for the perceptron file algorithm which is basically a serial search procedure. The linear-statis- Linear Separation and Learning 12.4 [205] tical machine is basically more parallel, because it finds each coefficient independently of the others, and needs only a fair sample of the F’s. (While superficially perceptron coefficients appear to be changed indi- vidually, each decision about a change depends upon a test involving all the coefficients.) 12.4,4 Layer-Machines Formula 3' suggests the design of a machine for making our decision: D is a device that simply decides which of its inputs is the largest. Each ^-device emits a standard-sized pulse [if <p(X) = 1] when X is presented. The pulses are multiplied by the w, 7 quantities as indicated, and summed at the 2-boxes. The 0j terms may be re- garded as corrections for the extent to which the pjs deviate from a central value of combined with the a priori bias concerning F y itself. It is often desirable to minimize the costs of errors, rather than simply the chance of error. If we define Cy* to be the cost of guessing F* when it is really Fy that has occurred, then it is easy to show that Formulas 1 and 2 now lead to finding the k that minimizes Z c Jk b, n j i where Bj = II < 7 ,y. It is interesting that this more complicated procedure also lends itself to the multilayer structure: [ 206 ] 12.4 Learning Theory It ought to be possible to devise a training algorithm to optimize the weights in this using, say, the magnitude of a reinforcement signal to communicate to the net the cost of an error. We have not investigated this. 12.4.5 Probability-estimation Procedures The A m e algorithm for the bayes linear-statistical procedure has to compute, or estimate, either the probabilities and pj of Formula 3 or other statistical quantities such as “weight of evi- dence” ratios p/( 1 - p). Normally these cannot be calculated directly (because they are, by definition, limits) so one must find estimators . The simplest way to estimate a probability is to find the ratio H/N of the number H of “favorable” events to the number N of all events in a sequence. If <p [t] is the value of (p on the tth trial, then an estimate of (P(<p = 1) after n trials can be found by the procedure: start: Set a to 0. Set n to 1 . O . (/! - l)a + *p [n] repeat: Set a to . n Set n to n + 1 . Go to REPEAT. which can easily be seen to recompute the “score” H/N after each event. This procedure has the disadvantage that it has to keep a record of n , the number of trials. Since n increases beyond bound, this would require unlimited memory. To avoid this, we rewrite the Linear Separation and Learning 12.4 [207] above program’s computation in the form a 1 " 1 = ^ “ This suggests a simpler heuristic substitute: define a [01 = 0, = (i _ + * • * w , (5) where e is a “small” number between 0 and 1. It is easy to show that as n increases the expected or mean value of a ln \ which we will write as (a |ni >, approaches p (that is, ( <p) ) as a limit. For (a 1 ' 1 ) = (1 - «)(a 101 ) + = tp = [1 - (1 - «)]/>, + 1 n / n and <a 121 > = (1 - e)(l - (1 - e))p + ep = (1 - (1 - e) 2 )p, and one can verify that, for all n, <« w > = (i - (i - *Y)P — ► p. (as « — ► oc ) Thus, Process 5 gives an estimation of the probability that <p = A more detailed analysis would show how the estimate depends on the events of the recent past, with the effect of ancient events decaying exponentially — with coefficients like (1 - e) (/o_r) . Because Process 5 “forgets,” it certainly does not make “optimal” use of its past experience; but under some circumstances it will be able to “adapt” to changing environmental statistics, which could be a good thing. As a direct consequence of the ^ecay, our esti- mator has a peculiar property: its variance “(r 2 ” does not ap- proach zero. In fact, one can show that, for Process 5, e p ( i - p) 2 - € [208] 12.4 Learning Theory and this, while not zero, will be very small whenever e is. The situation is thus quite different from the H/N estimate— whose variance is p( 1 - p)/n and approaches zero as n grows. In fact, we can use the variance to compare the two procedures: If we “equate” the variances p { i - p) ■ - />) • -L 2 — 6 A2 we obtain 6 suggesting that the reliability of the estimate of p given by Process 5 is about the same as we would get by simple averaging of the last 2/e samples; thus one can think of the number 1/6 as cor- responding to a “time-constant” for forgetting. Convergence to the Fixed-Point Another estimation procedure one might consider is: start: Set a to anything. If ip = 1, set a to a -I- 1. If V? = 0, set a to (1 - e)a. Go to REPEAT. repeat: Linear Separation and Learning 12.4 [209] or, equivalently, one could write = (1 - e)a [n ~ [] + (1 + ea [n - l] )<p [n K It can be shown that this has an expected value, in the limit, of It is interesting that a direct estimate of the likelihood ratio is obtained by such a simple process as if = 1 add 1, otherwise multiply by ( 1 - e). The variance, in case anyone cares, is (i - p) 2 ’ 1 - (1 - o 2 ' 12.4.6 The Samuel Compromise In his classical paper about “Some Studies in Machine Learning using the Game of Checkers,” Arthur L. Samuel uses an in- genious combination of probability estimation methods. In his application it occasionally happens that a new' evidence term tp,- is introduced (and an old one is abandoned because it has not been of much value in the decision process). When this happens there is a problem of preventing violent fluctuations, because after one or a few trials the term’s probability estimate will have a large variance as compared with older terms that have better statistical records. Samuel uses the following algorithm to “stabilize” his system: he sets a [()] to \ and uses >+i] = 1 N + N« V where N is set according to the “schedule”: f 16 if n < 32, N =] T" if 2"' < n < 2 m+l and 32 < n < 256, [256 if 256 < n. Thus, in the beginning the estimate is made as though the prob- ability had already been estimated to be | on the basis of several. [210] 12.4 Learning Theory that is, the order of 16, trials. Then in the “middle” period, the algorithm approximates the uniform weighting procedure. Finally (when n ~ 256) the procedure changes to the exponential decay mode, with fixed N, so that recent experience can outweigh earlier results. (The use of the powers of two represents a convenient computer-program technique for doing this.) In Samuel’s system, the terms actually used have the form we found in Inequality 4' of §12.4.3 2ip [t] - 1 so that the “estimator” ranges in the interval - 1 < p [,] < + i and can be treated as a “correlation coefficient.” 12.4.7 A Simple “Synaptic” Reinforcer Theory Let us make a simple “neuronal model.” The model is to estimate Pi! = P(<pi\Fj), using only information about occurrences of \<Pi = 1] and of [<L e F y ]. Our model will have the following “anatomy”: The bag B , contains a very high and constant concentration of a substance E. When or F ) occur — or “fire” — the walls of the corresponding bags and/or C, become “permeable” to E for a moment. If (pi alone occurs, nothing really changes, because /?, is surrounded by the impermeable C y . If F y alone occurs, Cj loses some E by diffusion to the outside: in fact, if a is the amount of E in Cj it may be assumed (by the usual laws of diffusion and concentration) to lose some fraction e of a: a' = (1 — e)a if J F j occurs and [ <Pi = 0 . If both (fi and F y occur then approximately the same loss will occur from Cj. Simultaneously, an essentially constant amount b will be “injected” by diffusion from B t to C f . So , x \Fj occurs and a = (1 — e)a + b if ^ Linear Separation and Learning 12.5 [21 1J (We can assume that b is constant because the concentration of E is very high in Z?, compared to that in C 7 . One can invent any number of similar variations.) In either case we get a' = (1 — e)a + <pb Jk. . p so that in the limit the mean of a approaches ~b — p- (as can be seen from the analysis in §12.4.5). This is proportional to, and hence an estimator of p tj = P(<^, | F 7 ). Thus the simple anatomy, combined with the membrane be- coming permeable briefly following a nerve impulse, could give a quantity that is an estimator of the appropriate probability. How could this representation of probability be translated into a useful neuronal mechanism? One could imagine all sorts of schemes: ionic concentrations or rather, their logarithms! could become membrane potentials, or conductivities, or even probabilities of occurrences of other chemical events. The “an- atomy” and “physiology” of our model could easily be modified to obtain likelihood ratios. Indeed, it is so easy to imagine variants- the idea is so insensitive to details — that we don’t pro- pose it seriously, except as a family of simple yet intriguing models that a neural theorist should know about. 12.5 A fi]e Algorithms for the isodata Procedure In this section we describe a procedure proposed by G. Ball and D. Hall to delineate “clusters” in an inhomogeneous distribution of vectors. The idea is best shown by a pictorial example: imagine a two-dimensional set of points (<F| that fall into obvious clusters, like [212] 12.5 Learning Theory Begin by placing a few “cluster-points” A[ 1] into the space at some arbitrary locations, say, near the center. We then divide the set of $’s into subsets R„ assigning each <1> to the nearest A[ 1] point: Next, we replace each A[ 1] by a new cluster-point A[ 21 which is the mean or center-of-gravity of the $’s in R ; , and then define R[ 2] to be the set of #’s nearest to A 2 : From now on, there is little or no change; the cluster-points have “found the clusters.” Ball and Hall give a number of heuristic refinements for creating and destroying additional cluster points; for example, to add one if the variance of an R-set is “too large” and to remove one if two are “too [ 214 ] 12.5 Learning Theory close” together. Of course, one can usually “see” two-dimensional clusters by inspection, but isodata is alleged to give useful results in ^-dimensional problems where “inspection” is out of the question. To use this procedure, in our context, we need some way to com- bine its automatic classification ability with the information about the F-classes. An obvious first step would be to apply it separately to each F-class, and assign all the resulting A’s to that class. We do not know much about more sophisticated schemes that might lead to better results in the A find stage. 12.5.1 An isodata Convergence Theorem There is a theorem about isodata (told to us by T. Cover) that suggests that it leads to some sort of local minimum. Let us formalize the pro- cedure by defining A [ " ] ($) = the A [ ” ] that is nearest to #. (If there are several equidistant nearest A/s, choose the one with the smallest index.) R [ / ] = the set of <Fs for which A [n] = A [ / ] . A [ / + 11 = mean<R [ ? ] >. Finally define a “score”: S 1 " 1 = _ A [ " i (*)| 2 - all $ Theorem: > 5(21 > • • • > ^ 1 • • • until there is no further change, that is, until A [ ^ = A^ + 1] for all z. proof: -ZZl* - A [ ; ] | 2 j rW >ZI>- at "! 2 J R[ P because the mean (A [ ” +1] ) of any set of vectors (R [ f) is just the point that minimizes the squared-distance sum to all the points (of R [ f). And this is, in turn, > Z Z 1* - A ‘" +I| i 2 - ^ + " Z R [tf+1] Linear Separation and Learning 12.6 [215] because each point will be transferred to an R t - + t] for which the distance is minimal, that is, for all y, [*- a [ ; +1 M > |# - a [ f\. Corollary: The sequence of decreasing positive numbers, approaches a limit. If there is only a finite set of frs, the A’s must stop changing in a finite number of steps. For in the finite case there are only a finite number of partitions {R,-j possible. 12.5.2 Incremental Methods In analogy to the “reinforcement” methods in §12.4.5 we can ap- proximate isodata by the following program: start: Choose a set of starting points A/. repeat: Choose a <£. Find A($); the A, nearest to d>. Replace A (<£) by (1 - e)A(^) -ft*#. Go to REPEAT. It is clear that this program will lead to qualitatively the same sort of behavior; the A’s will tend toward the means of their R-regions. But, just as in §12.4, the process will retain a permanent sampling- and-forgetting variance, with similar advantages and disad- vantages. In fact, all the A fi i e algorithms we have seen can be so approximated: there always seems to be a range from very local, incremental methods, to more accurate, somewhat less “adap- tive” global schemes. We resume this discussion in §12.8. 12.6 Time vs. Memory for Exact Matching Suppose that we are given a body of information — we will call it the data set — in the form of 2 a binary words each b digits in length (Figure 12.10); one can think of them as 2 a points chosen at random from a space of 2 b points. (Take a million ^ 2 20 words of length 100, for a practical example.) We will suppose that the data set is to be chosen at random from all possible sets so that one cannot expect to find much redundant structure within it. Then the ordered data set requires about b • 2 a bits of binary information for complete description. We won’t, however, be in- [216] 12.6 Learning Theory terested in the order of the words in the data set. This reduces the amount of information required to store the set to about (b - a) • 2° bits. We want a machine that, when given a random b - digit word w, will answer question 1 . Is w in the data set?* and we want to formulate constraints upon how this machine works in such a way that we can separate computational aspects from memory aspects. The following scheme achieves this goal well enough to show, by examples, how little is known about the conjugacy between time and memory. We will give our machine a memory of M separate bits — that is, one-digit binary words. We are required to compose — in advance, before we see the data set — two algorithms A fiie and A find that satisfy the following conditions: 1. A fii e is given the data set. Using this as data, it fills the M bits of memory with information. Neither the data set nor A fi i e are used again, nor is Afj nd allowed to get any information about what Afn e did, except by inspecting the contents of M. *We will get to Question 2 in about fifteen minutes. Linear Separation and Learning 12.6 [217] 2. A find is then given a random word, w, and asked to answer Question 1, using the information stored in the memory by A fi | e . We are interested in how many bits Afi nd has to consult in the process. 3. The goal is to optimize the design of A fi i e and A find to minimize the number of memory references in the question-answering computa- tion, averaged over all possible words w. 12.6.1 Case 1: Enormous Memory It is plausible that the larger be M, the smaller will be the average number of memory-references A find must make. Suppose that M > 2 b . Let mj be the / th bit in memory; then there is a bit m w for each possible query word w, and we can define (Afiie*. set m w to 1 if w is in the data set |Afi nd : w is in the data set if m w = 1 . Thus, with a huge enough memory, only one reference is required to answer Question 1. 12.6.2 Case 2: Inadequate Memory Suppose that M <(b - d)2 a . Here, the problem cannot be solved at all, since A fi)e cannot store enough information to describe the data set in sufficient detail. 12.6.3 Case 3: Binary Logarithmic Sort Suppose that M = b -2 a . Now there is enough room to store the ordered data set. Define "Afiie: store the words of the data set in ascending numerical I order. Afi nd : perform a binary search to see first which half of memory v might contain w, then which quartile, etc. [218] 12.6 Learning Theory This will require at most a = log 2 2 a inspections of Z?-bit words, that is, a • b bit-inspections. This is not an optimal search since, (1) one does not always need to inspect a whole word to decide which word to inspect next, and (2) it does not exploit the uniformity of distribution that the first a digits of the ordered data set will (on the average) show. Effect 1 reduces the required number from a • b to the order of \ a • b and effect 2 reduces it from a • b to a • (b — a). We don’t know exactly how these two effects combine. 12.6.4 Case 4: Exhaustive Search Consider M = (b - a) 2 a . This gives just about enough memory to represent the unordered data set. For example we could define A file : First put the words of the data set in numerical order. Then compute their successive differences. These will re- quire about ( b - a) bits each. Use a standard method of information theory, Huffman Coding (say), to represent this sequence; it will require about ( b — a) 2 a bits. But the only retrieval schemes we can think of are like A find : Add up successive differences in memory until the sum equals or exceeds w. If equality occurs, then w is in the data set. And this requires ~\{b - a) 2 a memory references, on the aver- age. It seems clear that, given this limit on memory, no A file - A find pair can do much better. That is, we suspect that If no extra memory is available then to answer Question l one must , on the average , search through half the memory. One might go slightly further: even Huffman Coding needs some extra memory, and if there is none, A fi i e can only store an efficient “number” for the whole data set. Then the conjecture is that A find must almost always look at almost all of memory. Linear Separation and Learning 12.6 [219] 12.6.5 Case 5: Hash Coding Consider M = 6 • 2" • 2. Here we have a case in which there is a substantial margin of extra memory — about twice what is necessary for the data set. The result here is really remarkable — one might even say counter- intuitive- because the mean number of references becomes very small. The procedure uses a concept well known to programmers who use it in “symbolic assembly programs” for symbol-table references, but does not seem to be widely known to other com- puter “specialists.” It is called hash coding. There are many variants of this idea. We discuss a particular form adapted to a redundancy of two. In the hash-coding procedure, A fI , e is equipped with a subprogram R(w,j) that, given an integer j and a 6-bit word w, produces an (a 4 - l)-bit word. The function R(wJ) is “pseudorandom” in the sense that for each j, R(wJ) maps the set of all 2 b input words with uniform density on the 2 a+1 possible output words and, for different f s, these mappings are reasonably independent or orthogonal. One could use symmetric functions, modular arithmetics, or any of the conventional pseudorandom methods.* Now, we think of the 6 • 2 a+, -bit memory as organized into 6-bit registers with (< a + l)-bit addresses: Suppose that A fi i e has already filed the words w u . . . , w„, and it is about to dispose of w n+l . Af lle : Compute R(w n+U 1). If the register with this address is empty put w n+l in it. If that register is occupied do the same with R(w n+l , 2), /?(w„ + i, 3), . . . until an unoccupied register R(w n+[ ,j) is found; file w n+1 therein. A find- Compute R ( w, 1). If this register contains w, then w is in the data set. If R(w , 1) is empty, then w is not in the data set. If /?(w, 1) contains some other word not w, then do *There is a superstition that R(w,j) requires some magical property that can only be approximated. It is true that any particular R will be bad on particular data sets, but there is no problem at all when we consider average behavior on all possible data sets. [220] 12.6 Learning Theory the same with /?(w, 2), and if necessary R(w, 3), R(w, 4), . . . , until either w or an empty register is discovered. On the average, A fi | e will make less than 2b memory-bit references ! To see this, we note first that, on the average, this procedure leads to the inspection of just 2 registers! For half of the registers are empty, and the successive values of R(wJ) for j = 1,2, ...are independent (with respect to the ensemble of all data sets) so the mean number of registers inspected to find an empty register is 2. Actually, the mean termination time is slightly less , because for w’s in the data set the expected number of inspected registers is < 2. The procedure is useful for symbol-tables and the like, where one may want not only to know if w is there, but also to retrieve some other data as- sociated (perhaps again by hash coding) with it. When the margin of redundancy is narrowed, for example, if M = — - — ■ b ■ 2°, n - 1 then only (\/n)th of the cells will be empty and one can expect to have to inspect about n registers. Because people are accustomed to the fact that most computers are “word-oriented” and normally do inspect b bits at each memory cycle the following analysis has not (to our knowledge) been carried through to the case of 1-bit words. When we pro- gram A find to match words bit by bit we find that, since half the words in memory are zero, matching can be speeded up by assigning a special “zero” bit to each word. Assume, for the moment, that we have room for these 2° extra bits. Now suppose that a certain w 0 is not in the data set. (This has probability 1 - 2 a ~ b .) First inspect the “zero” bit associated with 7?(w 0 ,l). This has probability \ of being zero. If it is not zero, then we match the bits of w 0 with those of the word associated with R(w 0 , 1). These cannot all match (since w 0 isn’t in the data set) and in fact the mismatch will be found in (an average of) 2 = 1 -f ^ + i + • • • references. Then the “zero” bit of /?(w 0 ,2) must be inspected, and the process repeats. Each repeat has prob- ability 2 and the process terminates when the “zero” bit of some R( w oJ) = 0. The expected number of references can be counted Linear Separation and Learning 12.6 [221] then as i(l 4- 2 + ^(1 4- 2 4- ^(. . .))) 4- 1 = 3 + 1 = 4. If w 0 is in the data set (an event whose probability is 2 a ~ b ) and we repeat the analysis we get 4 4 - b references, because the process must terminate by matching all b bits of vv 0 . The expected number of bit-references, overall, is then 4(1 - 2 a ~ b ) 4- (4 + b) 2 a ~ b = 4 + b • 2 a ~ b ~ 4 since normally 2 a ~ b will be quite tiny. We consider it quite re- markable that so little redundancy — a factor of two — yields this small number! The estimates above are on the high side because in the case that vv 0 is in the data set the “run length” through R(w 0 ,j) will be shorter, by nearly a factor of 2, than chance would have it just because they were put there by A fi)e . On the other hand, we must pay for the extra “zero” bits we adjoined to M. If we have M = 2b *2° bits and make words of length b + 1 instead of b , then the memory becomes slightly more than half full: in fact, we must replace “4” in our result by something like 4 [(Z? + 1 )/{b - 1)]. Perhaps these two effects will offset one another; we haven’t made exact calculations, mainly because we are not sure that even this A fi | e -A find pair is optimal. It certainly seems suspicious that half the words in memory are simply empty! On the other hand, the best one could expect from further im- proving the algorithms is to replace 4 by 3 (or 2?), and this is not a large enough carrot to work hard for. 12.6.6 Summary of Exact Matching Algorithms To summarize our results on Question 1 we have established upper bounds for the following cases: We believe that they are close to lower bounds also but, especially in cases 3 and 4, are not sure. Case Memory size Bit-references Method 2 <{b - a) 2° OO (impossible) 4 (b - a)2° Hb - a) 2“ (search all memory) 3 b- 2“ \b • a (logarithmic sort) 5 2b • 2 a 4 4- 6 (hash coding) 1 >2 b 1 (table look-up) [222] 12.7 Learning Theory 12.7 Time vs. Memory for Best Matching: An Open Problem We have summarized our (limited) understanding of “Question 1” — the exact matching problem — by the little table in §12.6.6. If one “plots the curve” one is instantly struck by the effectiveness of small degrees of redundancy. We do not believe that this should be taken too seriously, for we suspect that when the prob- lem is slightly changed the result may be quite different. We con- sider now question 2: Given w, exhibit the word w closest to w in the data set . The ground rules about A file and A find are the same, and distance can be chosen to be the usual metric, that is, the number of digits in which two words disagree. If X \ , . . . , x b and X \ , . . . , x b are the (binary) coordinates of points w and w then we define the Hamming distance to be b d(w, >v) = I*/ “ *il- /= 1 One gets exactly the same situation with the usual Cartesian distance C(w, vv), because [C(w, iv)] 2 = 2 \xi - x,-p = 2 |x, - xt \ = d(w,w) so both C(w, w) and d(w , w) are minimized by the same w. 12.7.1 Case 1 : M = 2 h -b. A fiie assigns for every possible word w a block of b bits that con- tain the appropriate bits of the correct w. Afi nd looks in the block for w and writes out w. It uses b references, which seems about the smallest possible number. 12.7.2 Case 2: M < (b - a) 2 °. Impossible, for same reason as in Question 1. 12.7.3 Case 3: M = b-2 a No result known. 12.7.4 Case 4: Af = (b - a) 2 a This presumably requires ( b - a)-2 a references, that is, all of memory, for the same reason as in Question 1 . Linear Separation and Learning 12.7 [223] 12.7.5 Case 5: (b - a) 2 a < M < b • 2 b No useful results known. 12.7.6 Gloomy Prospects for Best Matching Algorithms The results in §12.6.6 showed that relatively small factors of re- dundancy in memory size yield very large increases in speed, for serial computations requiring the discovery of exact matches. Thus, there is no great advantage in using parallel computational mechanisms. In fact, as shown in §12.6.5, a memory-size factor of just 2 is enough to reduce the mean retrieval time to only slightly more than the best possible. But, when we turn to the best match problem, all this seems to evaporate. In fact, we conjecture that even for the best possible A fiie~ A fi nd pairs, the speed-up value of large memory redundancies is very small, and for large data sets with long word lengths there are no practical alternatives to large searches that inspect large parts of the memory. We apologize for not having a more precise statement of the con- jecture, or good suggestions for how to prove it, for we feel that this is a fundamentally important point in the theory of computa- tion, especially in clarifying the distinction between serial and parallel concepts. Our belief in this conjecture is based in part on experience in find- ing fallacies in schemes proposed for constructing fast best- matching file and retrieval algorithms. To illustrate this we discuss next the proposal most commonly advanced by students. 12.7.7 The Numerical-Order Scheme This proposal is a natural attempt to extend the method of Case 3 (12.6.3) from exact match to best match. The scheme is A file : store the words of the data set in numerical order. Afind- given a word w, find (by some procedure) those words whose first a bits agree most closely with the first a bits of w. How to do this isn’t clear, but it occurs to one that (since this is the same problem on a smaller scale!) the procedure could be recursively defined. Then see how well the other bits of these words match with w. Next, ...(?) The intuitive idea is simple: the word w in the data set that is closest to w ought to show better-than-chance agreement in the [224] 12.7 Learning Theory first a bits, so why not look first at words known to have this property. There are two disastrous bugs in this program: 1. When can one stop searching? What should we fill in where we wrote “Next We know no nontrivial rule that guarantees getting the best match. 2. The intuitive concept, reasonable as it may appear, is not valid! It isn’t even of much use for finding a good match, let alone finding the best match. To elaborate on point 2, consider an example: let a = 20, b = 10,000. Let w, for simplicity, be the all-zero word. A typical word in the data set will have a mean of 5000 one's, and 5000 zero's. The standard deviation will be 10,000) 1/2 = 50. Thus, less than one word in 2 a = 2 20 can be expected to have fewer than 4750 one's. Hence, the closest word in the data set will (on the average) have at least this many one's. That closest word will have (on the average) >20 • (4750/10,000) = 9.5 one's among its first 20 bits! The probability that w will indeed have very few one's in its first 20 bits is therefore extremely low, and the slight favorable bias obtained by inspecting those words first is quite utterly negligible in reducing the amount of inspection. Besides, objection 1 still remains. The value of ordering the first few bits of the words is quite useless, then. Classifying words in this way amounts, in the ^-dimensional geometry, to breaking up the space into “cylinders” which are not well shaped for finding nearby points. We have, therefore, tried various ar- rangements of spheres, but the same sorts of trouble appear (after more analysis). In the course of that analysis, we are led to suspect that there is a fundamental property of /7-dimensional geometry that puts a very strong and discouraging limitation upon all such algorithms. 12.7.8 Why is Best Match so Different from Exact Match? If our unproved conjecture is true, one might want at least an intuitive explanation for the difference we would get between §12.6.3 and §12.7.3. One way to look at it is to emphasize that, though the phrases “best match” and “exact match” sound simi- lar to the ear, they really are very different. For in the case of exact match, no error is allowed, and this has the remarkable effect of changing an n-dimensional problem into a one-dimensional problem! For best matching we used the formula b b Error = ^ \x t — x t -l = ^ 1 \x t - x ( \. Linear Separation and Learning 12.8 [225] where we have inserted the coefficient “1” to show that all errors, in different dimensions, are counted equally. But for exact match, since no error can be tolerated, we don’t have to weight them equally: any positive weights will do! So for exact match we could just as well write b b Error = ^ 2 ; | x,- - Jc,-| or even Error = 2 \xj - x { ) i= i /= i because either of these can be zero only when all x t = x h (Shades of stratification.) But then we can (finally) rewrite the latter as Error = (2 2‘x,) - (2 2%) and we have mapped the w-dimensional vector (x,, . . . ,x b ) into a single point on a one-dimensional line. Thus these superficially similar problems have totally different mathematical personali- ties! 12.8 Incremental Computation All the A file algorithms mentioned have the following curiously local property. They can be described roughly as computing the stored information M as a function of a large data set: M = Afile (data set) Now one can imagine algorithms which would use a vast amount of temporary storage (that is, much more than M or much more than is needed to store the data set) in order to compute M. Our Afii e algorithms do not. On the contrary, they do not even use significantly more memory capacity than is needed to hold their final output, M. They are even content to examine just one mem- ber of the data set at a time, with no control over which they will see next, and without any subterfuge of storing the data in- ternally. It seems to us that this is an interesting property of computation that deserves to be studied in its own right. It is striking how many apparently “global” properties of a data set can be com- puted “incrementally” in this sense. Rather than give formal definitions of these terms, we shall illustrate them by simple examples. [226] 12.8 Learning Theory Suppose one wishes to compute the median of a set of a million distinct numbers which will be presented in a long, disordered list. The standard solution would be to build up in memory a copy of the entire set in numerical order. The median number can then be read off. This is not an incremental computation because the temporary memory capacity required is a million times as great as that required to store the final answer. More- over it is easy to see that there is no incremental procedure if the data is presented only once. The situation changes if the list is repeated as often as we wish. For then two registers are enough to find the smallest number on one pass through the list, the second smallest on a second pass, and so on. With an additional register big enough to count up to half the number N of items in the list, we can eventually find the median. It might seem at first sight, however, that an incremental compu- tation is precluded if the numbers are presented in a random se- quence, for example by being drawn, with replacement, from a well-stirred urn. But a little thought will show that an even more profligate expenditure of time will handle this case incrementally provided we can assume (for example) that we know the number of numbers in the set and are prepared to state in advance an acceptable probability of error. What functions of big “data sets” allow these drastic exchanges of time for storage space? Readers might find it amusing to con- sider that to compute the best plane (§12.2.3), given random presentation of samples, and bounds on the coefficients, requires only about three solution-sized memory spaces. One predicate we think cannot be computed without storage as large as the data set is: [the numbers in the data set, concatenated in numerically increasing order, form a prime numberl. In case anyone suspects that all functions are incrementally com- putable (in some sense) let him consider functions involving decisions about whether concatenations of members of the data set are halting tapes for Turing machines. Perceptrons and Pattern Recognition 13 13.0 Introduction Many of the theorems show that perceptrons cannot recognize cer- tain kinds of patterns. Does this mean that it will be hard to build machines to recognize those patterns? No. Ail the patterns we have discussed can be handled by quite simple algorithms for general-purpose computers. Does this mean , then , that the theorems are very narrow , applying only to this small class of linear-separation machines? Not at all. To draw that conclusion is to miss the whole point of how mathematics helps us understand things! Often, the main value of a theorem is in the discovery of a phenomenon itself rather than in finding the precise conditions under which it occurs. Everyone knows, for example, about the “Fourier Series phe- nomenon” in which linear series expansions over a restricted class of functions (the sines and cosines) are used to represent all the functions of a much larger class. But only a very few of us can recall the precise conditions of a theorem about this! The impor- tant knowledge we retain is heuristic rather than formal — that a fruitful approach to certain kinds of problems is to seek an ap- propriate base for series expansions. That sounds very sensible. But how might it apply to the theorems about perceptrons? For example, the stratification theorem shows that certain predi- cates have lower order than one’s geometric intuition would sug- gest; one can encode information in a “nonstandard” way by using very large coefficients. The conditions given for Theorem 7.2 are somewhat arbitrary, and many predicates can be realized in similar ways without meeting exactly these conditions. The theorem itself is just a vehicle for thoroughly understanding an instance of the more general encoding phenomenon. Does it apply also to the negative results? Yes, although it is harder to tell when “phenomena of limita- tions” will extend to more general machine-schemas. After we circulated early drafts of this book, we heard that some percep- tron advocates made statements like “Their conclusions hold only [228] 13.0 Perceptrons and Pattern Recognition if their conditions are exactly satisfied; our machines are not exactly the same as theirs.” But consider, for example, the kind of limitation demonstrated by the And/Or theorem. Although the limitation as stated could be circumvented by adding another layer of logic to the machine scheme to permit “and”-ing two perceptrons together, this would certainly miss the point of the phenomenon. To be sure, the new machine will realize some predicates that the simpler machines could not. But if the and/or phenomenon is understood, then the student will quickly ask: Is the new machine itself subject to a similar closure limitation? We expect that no moderate extension of the machine-schema in such a direction would really make much difference to its ability to handle context-dependence. We believe (but cannot prove) that the deeper limitations extend also to the variant of the perceptron proposed by A. Gamba. We discuss this in the next section. 13.1 Gamba Perceptrons and other Multilayer Linear Machines In a series of papers (1960, 1961), A. Gamba and his associates describe experiments with a type of perceptron in which each <p is itself a thresholded measure , that is, a perceptron of order 1: (Pi = * = Z i Z i PijXj > @i -> PijXj > di > 0 This scheme lends itself to physically realizable linear devices. For example, each <p could be realized by an optical filter and thresh- old photodetector (Figure 13.1). Filters have been proposed that range from completely random patterns to carefully selected “feature detectors,” moment inte- grals, and templates. One can even obtain complex values for the Py s by using paired masks, or phase-coherent optics. We would like to have a good theory of these machines, especially because very large arrays can be obtained so economically by optical and similar techniques. Perceptrons and Pattern Recognition 13.1 [229] Unfortunately, we do not know how to adapt the algebraic methods that worked on order-limited perceptrons, nor have we found any other analytic techniques. We can thus make only a few observations and ask questions. Note first that if the inner threshold operations are eliminated, we have simply an order-1 perceptron: 01 i H @ijXj > 9 i j ’Y^a'jXj > 6 j That the nonlinear operations have a real effect is shown by the fact that a simple Gamba machine can recognize the predicate tABdx) = [(I* n a \ > \x n fib V (\xn c\>\xn a 1)1 that was shown in Chapter 4 to be not of finite order. For if we define 1 r i if xjt A f 1 if Xj€ C Pv = i -i if Xjt B /s v =\-\ if Xj = B 1 l 0 otherwise { 0 otherwise [230] 13.1 Perceptrons and Pattern Recognition then it follows that \p ABC = \<P\ + V 2 > 01 is recognizable by a Gamba machine. Another predicate of unbounded order, ^ par , TY i can b e realized by simple Gamba machines: in fact any predicate \p( X) = \p(\X\) that depends only on |X| can be realized as follows: define <P{n)(X) = flA'I > n\ = ISx, > n] and define «o = i^(0), a 1 = ’/'(l) - a 0 , * — «n+l = 'Pin + 1) — 22 1 = 0 Then we can write WX) 'l«l z a,<Po) If we place no limitation on the number of Gamba-masks, then the machines can recognize any pattern, since we can define, for each figure F, a template that recognizes exactly that figure: <Pf = Z X ‘ ~ Z X i ^ l^i x it F XjiF Then any class F of figures is recognized by \p ¥ = Z <Pf > 0 This is not an interesting result, for it says only that any class has a disjunctive Boolean form, and can be recognized if one allows as many Gamba masks as there are figures in the class. But it is interesting that the area-dependent classes like imparity ar >d 'Pabc require at most \R\ masks, as shown above. It is not hard to prove that imparity requires at least log \R \ Gamba masks, but it Perceptrons and Pattern Recognition 13.2 [231] would be nice to have a sharper result. We are quite sure that for predicates involving more sophisticated relations between sub- parts of figures the Gamba machines, like the order-limited ma- chines, are relatively helpless, but the precise statement of this conjecture eludes us. For instance, we think that ^connected would require an enormous number of Gamba-masks — perhaps nearly as many as the class of connected figures. Such conjectures, stated in terms of the number of <p' s in the machine, seem harder to deal with than the simpler categorical impossibility of recogni- tion with any number of masks of bounded order. 13.2 Other Multilayer Machines Have you considered “ perceptrons ” with many layers? Well, we have considered Gamba machines, which could be described as “two layers of perceptron.” We have not found (by thinking or by studying the literature) any other really interesting class of multilayered machine, at least none whose principles seem to have a significant relation to those of the perceptron. To see the force of this qualification it is worth pondering the fact, trivial in itself, that a universal computer could be built entirely out of linear threshold modules. This does not in any sense reduce the theory of computation and programming to the theory of per- ceptrons. Some philosophers might like to express the relevant general principle by saying that the computer is so much more than the sum of its parts that the computer scientist can afford to ignore the nature of the components and consider only their connectivity. More concretely, we would call the student’s atten- tion to the following considerations: 1. Multilayer machines with loops clearly open all the questions of the general theory of automata. 2. A system with no loops but with an order restriction at each layer can compute only predicates of finite order. 3. On the other hand, if there is no restriction except for the absence of loops, the monster of vacuous generality once more raises its head. The problem of extension is not merely technical. It is also strategic. The perceptron has shown itself worthy of study despite (and even because of!) its severe limitations. It has many features to attract attention: its linearity; its intriguing learning [232] 13.3 Perceptrons and Pattern Recognition theorem; its clear paradigmatic simplicity as a kind of parallel computation. There is no reason to suppose that any of these virtues carry over to the many-layered version. Nevertheless, we consider it to be an important research problem to elucidate (or reject) our intuitive judgment that the extension is sterile. Per- haps some powerful convergence theorem will be discovered, or some profound reason for the failure to produce an interesting “learning theorem” for the multilayered machine will be found. 13.3 Analyzing Real-World Scenes One can understand why you , as mathematicians , would be inter- ested in such clear and simple predicates as ^parity an d ^connected- But w ^ at tf one wants t0 build machines to recog- nize chairs and tables or people ? Do your abstract predicates have any relevance to such problems , and does the theory of the simple perceptron have any relevance to the more complex machines one would use in practice? This is a little like asking whether the theory of linear circuits has relevance to the design of television sets. Absolutely, some concept of connectedness is required for analyzing a scene with many objects in it. For the whole is just the sum of its parts and their interrelations, and one needs some analysis related to con- nectedness to divide the scene into the kinds of parts that corre- spond to physically continuous objects. Then must we conclude from the negative results of Chapter 5 that it will be very hard to make machines to analyze scenes? Only if you confine yourself to perceptrons. The results of Chap- ter 9 show that connectivity is not particularly hard for more serial machines. But even granting machines that handle connectivity , isn’t there an enormous gap from that to machines capable of finding the objects in a picture of a three-dimensional scene? The gap is not so large as one might think. To explain this, we will describe some details of a kind of computer program that can do it. The methods we will describe fall into the area that is today called “heuristic programming” or “artificial intelligence.”* *See, for example. Computers and Thought (Feigenbaum and Feldman, 1963) and Semantic Information Processing (Minsky, 1968). Perceptrons and Pattern Recognition 13.3 [233] Consider the problem of designing a machine that, when pre- sented with a photograph, will be capable of describing the fol- lowing scene. One would want the machine to say, at least, that this scene shows four objects — three of them rectangular and the fourth cylindrical — and to say something about their relative positions. The tradition of heuristic programming would suggest providing the machine with a number of distinct abilities such as the follow- ing; 1. The ability to detect points at which the light conditions change rapidly enough to suggest an edge. 2. The ability to “cluster” the set of proposed edge-points into subsets that may each be taken as a hypothetical line segment or curve. i i/ 3. The ability to pass from the “line drawing” to a list of con- nected regions or “faces.” 4. The ability to cluster faces into objects. In §13.4, we will de- scribe such a method, developed by one of our students, Adolfo Guzman. This procedure is remarkably insensitive to partial covering of one object by another. [234] 13.3 Perceptrons and Pattern Recognition 5. The ability to recognize certain features, such as shadows, as artifacts. Perhaps most important is 6. The ability to make each of the above decisions on a tentative “working” basis and retract them if something “implausible” happens in any phase of the procedure. For example, if a region in Step 3 turns out to have an unusually complicated shape (rela- tive to the class of objects for which it is designed) the existence of some of the lines proposed in Step 2 might be challenged, or others might be proposed, to be verified by re-activating Step 1 with a lower threshold. 7. All these processes might be organized by a supervisory pro- gram like the “General Problem Solver” of Newell, Shaw, and Simon (1959) or the executive program of a large programming system. A system with such a set of abilities is in a very different class from a perceptron, if only because of the variety of operations it performs and forms of knowledge it uses. People often suggest that the methods of artificial intelligence and those associated with the perceptron are not as opposed as we think. For example, they say a perceptronlike algorithm might be used at each “lever’ to make the separate kinds of distinction. But using a perceptron as a component of a highly structured system entirely degrades its pretention to be a “self-organizing” system. If one is going to design such a system, one might as well be pragmatic in choosing an appropriate algorithm at each stage. The spirit of the approach we have in mind is illustrated by the role in the following example of sequential operations, hy- potheses, and hierarchical descriptions. 13.4 Guzman’s Approach to Scene-Analysis In scenes like this. Perceptrons and Pattern Recognition 13.4 [235] where all the objects are rectangular solids, and do not occlude one another badly, we can discover the objects by the extremely local process of locating all the “Y-joints.” Each object contains at most one such distinctive feature. This could, of course, fail because of perspective, as in which could be a cube, or in (for we require each of the three angles of a Y -joint to span less than 180 degrees). A more serious failure is in the case of occlusion, as in where one of the Y -joints is completely hidden from view. But the great power of programs capable of hierarchical decisions is illus- trated by the possibility of first recognizing the small cube , then removing it , next extending the hidden lines , and so discovering the large cube! The program developed by Adolfo Guzman proceeds in a rather different way; his idea is to treat different kinds of local configura- [236] 13.4 Perceptrons and Pattern Recognition tions as providing different degrees of evidence for “linking” the faces that meet there. For example, in these three types of vertex configurations arrow I 1 T the “Y” provides evidence for linking I to II, II to III, and I to III. The “arrow” just links I to II. Because a “T” is ordinarily the result of one object occluding part of another, it is regarded as evidence against linking I to III or II to III (and it is neutral about I and II). Using just these rules, we can convert pictures into associated groups of faces as follows: we represent Y links by straight lines and arrow links by curves. Perceptrons and Pattern Recognition 13.4 [237] So far, there has been no difficulty in associating linked faces with objects. The usefulness of the variety of kinds of evidence shows up only in more complicated cases: In this figure we find some “false” links due to the merging of vertices on dif- ferent objects. To break such false connections, the program uses a hierarchical scheme that first finds subsets of faces that are very tightly linked (e.g., by two or more links). These “nuclei” then compete for more weakly linked faces. There is no competi- tion in Examples 1-4, but in 5 the single false links between the cubes are broken by his procedure. In Example 6 the “false” links are broken also. If a very simple “competition” algorithm were not adequate here, one could also take into account the negative evidence the two T-joints provide against linking I— III and II— III. We have described only the skeleton of his scheme; Guzman uses several other kinds of links, including evidence from collinear T-joint lines of the form [238] 13.4 Perceptrons and Pattern Recognition © © <D © and the effects of some vertices are modified by their associations with others. This variety of resources enables the program to solve scenes like in which some object faces are completely disconnected. The method of combining evidence used by this procedure might suggest a similarity to the perceptron’s weighting of evidence. However, the local character of this similarity most strongly marks the deep difference between the approaches. For Guzman’s algorithm uses this as but a small part of a procedure to evaluate links between abstract entities called faces which in turn have been proposed by another program that uses processes related to connectedness. In “locally ambiguous” figures, something more akin to reason- ing and problem solving is required. For example, the stack of cubes in Figure 13.2 can be falsely structured by the human visual process if one looks only at the center. This structure cannot be extended to the whole figure, suggesting that we, too, use a pro- Figure 13.2 Perceptrons and Pattern Recognition 13.6 [239] cedure that in case of failure can “back up” to a different hy- pothesis. It is outside the scope of this book to discuss heuristic program- ming in further detail. Readers who wish for more information should consult the references. The su^ec^r o ^ scene cd-(s r* Vlas <?civ/avfreJ dirq <*4/ ca ve^ev-eflces adr ev\f . 13.5 Why Prove Theorems? Why did you prove all these complicated theorems ? Couldn’t you just take a perceptron and see if it can recognize ^connected? No. 13.6 Sources and Development of Ideas Our debts to previous work and to other people, places, and institutions can best be described by a brief historical sketch of our work. Our collaboration began in 1963 when we were brought together by Warren S. McCulloch. We have a special debt to him not only because of this but because he was the first to think seriously about the problems we have studied. 13.6.1 The Group-Invariance Theorem We had both been interested in the perceptron since its announce- ment by Rosenblatt in 1957. In fact we had both presented papers related to its “learning” aspect at a symposium on information theory in London in 1960. Our serious attack on its geometric problems started in the spring of 1965. At that time, it was generally known that order- 1 perceptrons could not compute translation-invariant functions other than functions of \X\, but there was no hint as to how this might generalize. In retrospect the most obvious obstacle was the lack of a concept of order. Earlier studies on the power of perceptrons were based on <L-sets of partial functions defined by stochastic generative processes or subject to irrelevant conditions such as that they themselves be linear threshold functions of a small number of variables. Such limitations (as opposed to our |S(<p)| < k) seemed always to produce mathematically intractable situations and so reinforced the dominant tendency to approach the problem as one of statistics rather than algebra. In making the shift, we feel most closely anticipated by Bledsoe and Browning (1959) who [240] 13.6 Perceptrons and Pattern Recognition considered a pure order limitation on a type of conjunctively local machine. With the concept of order in mind, the general form of the group-invariance theorem became possible. But we first had to overcome at least four other obstacles of heuristically different kinds. 1 . We had to accept the value of studying the geometrically trivial predicate ^ PARITY . No reference to this predicate is logically neces- sary (or even helpful) in proving the group-invariance theorem, the And/Or theorem, or in explaining the principle of stratifica- tion. But we are convinced that its heuristic role was critical. Its very geometric triviality enabled us to see the algebraic principles in all these situations. The same comment applies to the role of the positive normal form: all our results can be proved without it. But at a time when we were thoroughly confused about every- thing, it allowed us to replace the bewildering variety of the set of all possible logical functions by the combinatorial tidiness of the masks. 2. The idea of averaging had been in our minds ever since reading Pitts and McCulloch (1947). It was reinforced by a fine proof offered by Tom Marill in response to an exposition, at an M.I.T. seminar, of our early thoughts on the subject. Marill observed that for any <p, if |S(<p)| < \R\ then \X | <p(X ) j contains the same number of even- and odd-sized X's. It follows immediately that for any set 4> of such ip* s, the sets of vectors j<J>(20| \X\ even) and \$(X)\ \X\ odd) must have the same center of gravity. They therefore cannot be separated by a hyperplane! Although highly suggestive, this proof is still marked by the basic weakness of all early thinking about perceptrons, that is, the pre- occupation with the image of predicates as sets of points in the |<£ [-dimensional Euclidean space. To obtain the group-invariance theorem, we had to break away from this image. Marill’s proof averages over a set of |<f> |-points; ours averages over a set of functionals defined on the subsets of \R\. Perceptrons and Pattern Recognition 13.6 [241] 3. An ‘‘obstacle” of a very different sort was lack of contact with classical mathematical methods of proven depth. The prox- imity of fundamental properties of polynomials, irreducible curves, Haar integrals, etc., brought the feeling of ‘‘real mathe- matics” as opposed to the “purely combinatorial” methods of earlier papers on perceptrons. This is still sufficiently rare in com- puter science to be significant. We are convinced that respect for “real mathematics” is a powerful heuristic principle, though it must be tempered with practical judgment. 4. We were reluctant to attach the condition to the group-invar- iance theorem that be closed under the group , for this seemed like a strong restriction. It took us a while to realize that this made the group-invariance theorem stronger rather than weaker! For the theorem is used mainly to show that various predicates cannot be realized by various perceptrons. The closure condition then says that such a predicate cannot be realized even by a perceptron that has all the appropriate (p's of each type. There- fore, it cannot be realized by any smaller perceptron, say, one with a random selection from such partial predicates. 13 . 6.2 }f / CONNECTED Our first gropings were driven largely by frustration at not being able to prove, even heuristically, the “obvious” fact that this predicate is not of finite order. The diversion to iAp ARITY was partly motivated by wanting a simpler case for study, partly by the hope of finding a reduction through a switching argument similar to that used to prove the diameter-limited theorem. Our first theorem about the order of ^connected came by a different route, via the one-in-a-box theorem. Although this solved our original problem (in a logical sense) we continued exploring switching arguments, following an intuition which later paid dividends in ways we cannot pretend to have explicitly anticipated. While we were developing the rather complex switching circuits explained in Chapter 5 we entirely missed the simpler argument suggested to us by David Huffman (§5.5). Although Huffman’s construction gives only a weak lower bound to the growth rate of the order of ^connected » it provides sufficient proof that the predicate is not of finite order. More- over, it shows how any predicate on a retina of \R | points can be reduced to computing ^connected on a retina of about 2 |/?1 points. Thus this predicate is shown formally to have a kind of universality for these parallel machines akin to that possessed by tree-search in the usual serial machine (and also to suffer from similar exponential disasters). [242] 13.6 Perceptrons and Pattern Recognition 13.6.3 More Topology A by-product of work on connectedness was the pleasing (and perhaps puzzling) positive result about the Euler predicate. In an early draft of the book we appended to this a mildly false proof that no other topological invariants could be in £(^ 0 ’ <P m ). When we discovered the correct proof of the theorem (§8.4) that there were no other diameter-limited topological predicates we firmly conjectured that quite different proof methods would be necessary to prove this for the order-limited ease. So we were astonished when Mike Paterson, a young British computation theorist who had offered to criticize the manuscript, showed how the ideas in §5.7 could be used to reduce this to the parity switching net, and thus to prove the theorem of §5.9. 13.6.4 Stratification This is the area in which our early intuitions proved most dis- astrously wrong. Our first formal presentation of the principal results in this book was at an American Mathematical Society symposium on Mathe- matical Aspects of Computer Science in April 1966. At this time we could prove that ^connected was not of finite order and con- jectured that the same must be true of the apparently “global” predicates of symmetry and twins described in §7.3. For the rest of 1966 the matter rested there. We were pleased and encouraged by the enthusiastic reception by many colleagues at the A.M.S. meeting and no less so by the doleful reception of a similar presentation at a Bionics meeting. However, we were now involved in establishing at M.I.T. an artificial intelligence laboratory largely devoted to real “seeing machines,” and gave no attention to perceptrons until we were jolted by attending an I.E.E.E. Workshop on Pattern Recognition in Puerto Rico early in 1967. Appalled at the persistent influence of perceptrons (and similar ways of thinking) on practical pattern recognition , we determined to set out our work as a book. Slightly ironically, the first results obtained in our new phase of interest were the pseudo-positive applications of stratification. Perceptrons and Pattern Recognition 13.6 [243] Our first contact with the phenomenon was when our student John White showed the predicate ^hollow square t0 have order three. We had believed it to be order four for reasons the reader will see if he tries to realize it with coefficients bounded inde- pendently of the size of the square. Perhaps we were so convinced of the extreme parallelism of the perceptron that we resisted see- ing how certain limited forms of serial computation could be encoded into the perceptron algorithm by constructing a size- domination hierarchy. Whatever the reason, it took us many months to isolate the strati- fication principle and so understand why we had failed to prove the group-invariance theorem for infinite retinas. It is clear that stratification is not a mere “trick” that can reduce the order of a predicate, but that unbounded coefficients admit an essentially wider range of sequential (conditional) computations, although at such a price that this is of only mathematical interest. We are convinced that most of the predicates in Chapter 7 have, under the bounded condition, no finite orders, and stratification makes the difference between finite and infinite. We have not actually proved this, however. 13.6.5 Learning and Memory The spirit of the theory in Chapters 11 and 12 is very different from that of our geometric theory. To begin with, the research objectives seem to face in an opposite direction: learning theorems have statements like “If a given predicate is in an L(#) then a certain procedure will find a set of coefficients to represent it.” Whereas, in the main part of our work the main questions were directed towards understanding when and why certain predicates are in certain L($)’s. Also the proper context for the learning theory seems indeed to be the ^-dimensional coefficient-space in which figures and predicates become points and hyperplanes (or dually). We have emphasized several times that progress towards the geometric theory seemed to come only when we could break away from this representation. However, we decided to discuss the convergence theorem at first mainly because we felt dissatis- fied with the uncritical form of all previous presentations. In particular, it was customary to ignore such questions as Is the perceptron an efficient form of memory? [244] 13.6 Perceptrons and Pattern Recognition Does the learning time become too long to be practical even when separation is possible in principle? How do the convergence results compare to those obtained by more deliberate design methods? What is the perceptrons relation to other computational devices? As time went on, the last question became more and more im- portant to us. The comparison with the homeostat reinforced our interest in the perceptron as a good object for study in the mathe- matical theory of computation rather than as a practical machine, and we became interested in such questions as: Could we see the perceptron convergence theorem as a manifestation of a finite state situation? How is it related to hill climbing and other search techniques? Does it differ fundamentally from other methods of solving linear inequalities? We rather complacently thought that the first question would have an easy answer until our student, Terry Beyer, drew our attention to some difficulties and soundly conjectured that the way out would be to prove something like what eventually be- came Theorem 1 1.6. A concerted effort to settle the problem with the help of Dona Strauss led to an interesting but unpublishable proof. Soon after this we heard that Bradley Efron had proved a similar theorem and found that by borrowing an idea from him we could generate the demonstration given in §11.6. Efron, who did not publish his proof, credits Nils Nilsson with the conjecture that led him to the theorem. We now see the perceptron learning theorem as a simple example of a larger problem about memory (or information storage and retrieval), much as the nonlearning perceptron is a paradigm in the theory of parallel computation. Chapter 12 can be regarded as a manifesto on the importance of this problem. 13.7 Computational Geometry We like to think that the perceptron illustrates the possibility of a more organic interaction between traditional mathematical topics and ideas of computation. When we first talked about ^connected we thought we were studying an isolated fact about a type of computer device. By the time we had conjectured and started to prove that the Eulerian predicates were the only topo- logical ones of finite order, we felt we were studying geometry. Perceptrons and Pattern Recognition 13.9 [245] This might represent a bad tendency of people trained in classical mathematics to drag the new subject of computation back into their familiar territory. Or it could prefigure a future of com- putational thought not just as a new and separate autonomous science but as a way of thinking that will permeate all the other sciences. The truth must lie between these extremes. In any case, we are excited to see around us at M.I.T. a steady growth of research whose success is a confirmation of the value of the concept of “computational geometry” as well as of the talent of our col- leagues and students. Manuel Blum and Carl Hewitt obtained the first new result by studying the geometric ability of finite-state machines. More recently, Blum, Bill Henneman, and Harriet Fell have found in- teresting properties of the relations induced on Euclidean figures by the imposition of a discrete grid. Terry Beyer has discovered very surprising algorithms for geometric computation in iterative arrays of automata. Needless to say, these people have con- tributed to the work described in this book as much by their indirect influence on the intellectual atmosphere around us as by many pieces of advice, comment, and criticism of which only a small proportion is reflected in direct acknowledgments in the text. Many points of mathematical technique and exposition were improved by suggestions from W. Armstrong, R. Beard, L. Lyons, M. Paterson, and G. Sussman. 13.8 Other People In addition to those already mentioned, we owe much to W. W. Bledsoe, Dona Strauss, O. G. Selfridge, R. J. Solomonoflf, R. M. Fano. 13.9 Other Places For political and heuristic reasons, we mention that most of the new ideas came in new environments: beaches, swamps, and mountains. [246] 13.10 Perceptions and Pattern Recognition 13.10 Institutions Even if we had not been supported by the Advanced Research Projects Agency we would have liked to express the debt owed by computer science to its Information Sciences branch and to the imaginative band of men who built it: J. R. Licklider, I. E. Sutherland, R. W. Taylor, L. G. Roberts. In fact, ARPA has supported most of our work, through M.I.T.’s Project MAC and the Office of Naval Research. Epilogue: The New Connectionism When perceptron-like machines came on the scene, we found that in order to understand their capabilities we needed some new ideas. It was not enough simply to examine the machines them- selves or the procedures used to make them learn. Instead, we had to find new ways to understand the problems they would be asked to solve. This is why our book turned out to be concerned less with perceptrons per se than with concepts that could help us see the relation between patterns and the types of parallel-machine ar- chitectures that might or might not be able to recognize them. Why was it so important to develop theories about parallel ma- chines? One reason was that the emergence of serial computers quickly led to a very respectable body of useful ideas about algorithms and algorithmic languages, many of them based on a half-century’s previous theories about logic and effective computa- bility. But similarly powerful ideas about parallel computation did not develop nearly so rapidly — partly because massively parallel hardware did not become available until much later and partly because much less knowledge that might be relevant had been ac- cumulated in the mathematical past. Today, however, it is feasible either to simulate or to actually assemble huge and complex ar- rangements of interacting elements. Consequently, theories about parallel computation have now become of immediate and intense concern to workers in physics, engineering, management, and many other disciplines — and especially to workers involved with brain science, psychology, and artificial intelligence. Perhaps this is why the past few years have seen new and heated discussions of network machines as part of an intellectually aggres- sive movement to establish a paradigm for artificial intelligence and cognitive modeling. Indeed, this growth of activity and interest has been so swift that people talk about a “connectionist revolution.” The purpose of this epilogue, added in 1988, is to help present-day students to use the ideas presented in Perceptrons to put the new results into perspective and to formulate more clearly the research questions suggested by them. To do this succinctly, we adopt the strategy of focusing on one particular example of modern connec- tionist writing. Recently, David Rumelhart, James McClelland, and fourteen collaborators published a two-volume work that has become something of a connectionist manifesto: Parallel Distrib- uted Processing (MIT Press, 1986). We shall take this work (hence- [248] Epilogue forth referred to as PDP) as our connectionist text. What we say about this particular text will not, of course, apply literally to other writings on this subject, but thoughtful readers will seize the gen- eral point through the particular case. In most of this epilogue we shall discuss the examples in PDP from inside the connectionist perspective, in order to flag certain problems that we do not expect to be solvable within the framework of any single, homogeneous machine. At the end, however, we shall consider the same prob- lems from the perspective of the overview we call “society of mind,” a conceptual framework that makes it much more feasible to exploit collections of specialized accomplishments. PDP describes Perceptrons as pessimistic about the prospects for connectionist machinery: . . even though multilayer linear threshold networks are poten- tially much more powerful ... it was the limitations on what per- ceptrons could possibly learn that led to Minsky and Papert’s (1969) pessimistic evaluation of the perceptron. Unfortunately, that evaluation has incorrectly tainted more interesting and power- ful networks of linear threshold and other nonlinear units. As we shall see, the limitations of the one-step perceptrons in no way apply to the more complex networks.” (vol. 1, p. 65) We scarcely recognize ourselves in this description, and we recom- mend rereading the remarks in section 0.3 about romanticism and rigor. We reiterate our belief that the romantic claims have been less wrong than the pompous criticisms. But we also reiterate that the discipline can grow only when it makes a parallel effort to critically evaluate its apparent accomplishments. Our own work in Perceptrons is based on the interaction between an enthusiastic pursuit of models of new phenomena and a rigorous search for ways to understand the limitations of these models. In any case, such citations have given our book the reputation of being mainly concerned with what perceptrons cannot do, and of having concluded with a qualitative evaluation that the subject was not important. Certainly, some chapters prove that various impor- tant predicates have perceptron coefficients that grow unmanage- ably large. But many chapters show that other predicates can be surprisingly tractable. It is no more apt to describe our mathemat- ical theorems as pessimistic than it would be to say the same about Epilogue [249] deducing the conservation of momentum from the laws of mechan- ics. Theorems are theorems, and the history of science amply dem- onstrates how discovering limiting principles can lead to deeper understanding. But this happens only when those principles are taken seriously, so we exhort contemporary connectionist re- searchers to consider our results seriously as sources of research questions instead of maintaining that they “in no way apply.” What Perceptrons Can’t Do To put our results into perspective, let us recall the situation in the early 1960s; Many people were impressed by the fact that initially unstructured networks composed of very simple devices could be made to perform many interesting tasks — by processes that could be seen as remarkably like some forms of learning. A different fact seemed to have impressed only a few people: While those networks did well on certain tasks and failed on certain other tasks, there was no theory to explain what made the differ- ence — particularly when they seemed to work well on small (“toy”) problems but broke down with larger problems of the same kind. Our goal was to develop analytic tools to give us better ideas about what made the difference. But finding a comprehensive theory of parallel computation seemed infeasible, because the subject was simply too general. What we had to do was sharpen our ideas by working with some subclass of parallel machines that would be sufficiently powerful to perform significant computations, that would also share at least some of the features that made such networks attractive to those who sought a deeper understanding of the brain, and that would also be mathematically simple enough to permit theoretical analysis. This why we used the abstract defini- tion of perceptron given in this book. The perceptron seemed pow- erful enough in function, suggestive enough in architecture, and simple enough in its mathematical definition, yet understanding the range and character of its capabilities presented challenging puzzles. Our prime example of such a puzzle was the recognition of connectedness. It took us many months of work to capture in a formal proof our strong intuition that perceptrons were unable to [250] Epilogue represent that predicate. Perhaps the most instructive aspect of that whole process was that we were guided by a flawed intuition to the proof that perceptrons cannot recognize the connectivity in any general or practical sense. We had assumed that perceptrons could not even detect the connectivity of hole-free blobs — because, as we supposed, no local forms of evidence like those in figure 5.7 could correlate with the correct decision. Yet, as we saw in subsec- tion 5.8.1, if a figure is known to have no holes, then a low-order perceptron can decide on its connectivity; this we had not initially believed to be possible. It is hard to imagine better evidence to show how artificial it is to separate “negative” from “positive” results in this kind of investigation. To explain how this experience affected us, we must abstract what we learned from it. First we learned to reformulate questions like “Can perceptrons perform a certain task?” Strictly speaking, it is misleading to say that perceptrons cannot recognize connectedness, since for any particular size of retina we can make a perceptron that will recog- nize any predicate by providing it with enough cps of sufficiently high order. What we did show was that the general predicate re- quires perceptrons of unbounded order. More generally, we learned to replace globally qualitative questions about what per- ceptrons cannot do with questions in the spirit of what is now called computational complexity. Many of our results are of the form where R is a measure of the size of the problem and M is the magnitude of some parameter of a perceptron (such as the order of its predicates, how many of them might be required, the information content of the coefficients, or the number of cycles needed for learning to converge). The study of such relationships gave us a better sense of what is likely to go wrong when one tries to enlarge the scale of a perceptron-like computation. In serial computing it was already well known that certain algorithms de- pending on search processes would require numbers of steps of computation that increased exponentially with the size of the prob- lem. Much less was known about such matters in the case of paral- lel machines. The second lesson was that in order to understand what percep- trons can do we would have to develop some theories of “problem domains” and not simply a “theory of perceptrons.” In previous Epilogue [251] work on networks, from McCulloch and Pitts to Rosenblatt, even the best theorists had tried to formulate general-purpose theories about the kinds of networks they were interested in. Rosenblatt’s convergence theorem is an example of how such investigations can lead to powerful results. But something qualitatively different was needed to explain why perceptrons could recognize the connect- edness of hole-free figures yet be unable to recognize con- nectedness in general. For this we needed a bridge between a the- ory about the computing device and a theory about the content of the computation. The reason why our group-invariance theorem was so useful here was that it had one foot on the geometric side and one on the computational side. Our study of the perceptron was an attempt to understand general principles through the study of a special case. Even today, we still know very little, in general, about how the costs of parallel compu- tation are affected by increases in the scale of problems. Only the cases we understand can serve as bases for conjectures about what will happen in other situations. Thus, until there is evidence to the contrary, we are inclined to project the significance of our results to other networks related to perceptrons. In the past few years, many experiments have demonstrated that various new types of learning machines, composed of multiple layers of perceptron-like elements, can be made to solve many kinds of small-scale prob- lems. Some of those experimenters believe that these perfor- mances can be economically extended to larger problems without encountering the limitations we have shown to apply to single- layer perceptrons. Shortly, we shall take a closer look at some of those results and see that much of what we learned about simple perceptrons will still remain quite pertinent. It certainly is true that most of the theorems in this book are explicitly about machines with a single layer of adjustable connection weights. But this does not imply (as many modern connectionists assume) that our con- clusions don’t apply to multilayered machines. To be sure, those proofs no longer apply unchanged, because their antecedent condi- tions have changed. But the phenomena they describe will often still persist. One must examine them, case by case. For example, all our conclusions about order-limited predicates (see section 0.7) continue to apply to networks with multiple layers, because the order of any unit in a given layer is bounded by the product of the [252] Epilogue OUTPUT Figure 1 Symmetry using order-2 disjunction. orders of the units in earlier layers. Since many of our arguments about order constrain the representations of group-invariant predi- cates, we suspect that many of those conclusions, too, will apply to multilayer nets. For example, multilayer networks will be no more able to recognize connectedness than are perceptrons. (This is not to say that multilayer networks do not have advantages. For ex- ample, the product rule can yield logarithmic reductions in the orders and numbers of units required to compute certain high-order predicates. Furthermore, units that are arranged in loops can be of effectively unbounded order; hence, some such networks will be able to recognize connectedness by using internal serial processing.) Thus, in some cases our conclusions will remain provably true and in some cases they will be clearly false. In the middle there are many results that we still think may hold, but we do not know any formal proofs. In the next section we shall show how some of the experiments reported in PDP lend credence to some such conjectures. Recognizing Symmetry In this section we contrast two different networks, both of which recognize symmetrical patterns defined on a six-point linear retina. To be precise, we would like to recognize the predicate X is sym- metric about the midpoint of R. Figure 1 shows a simple way to represent this is as a perceptron that uses R 9 units, each of order 2. Each one of them will locally detect a deviation from symmetry Epilogue [253] OUTPUT Actual coefficients from PDP experiment Figure 2 Symmetry using order-7? stratification. at two particular retinal points. Figure 2 shows the results of an experiment from PDP. It depicts a network that represents Asymmetry m quite a different way. Amazingly, this network uses only two cp functions — albeit ones of order R. The weights displayed in figure 2 were produced by a learning procedure that we shall describe shortly. For the moment, we want to focus not on the learning problem but on the character of the coefficients. We share the sense of excitement the PDP experi- menters must have experienced as their machine converged to this strange solution, in which this predicate seems to be portrayed as having a more holistic character than would be suggested by its conjunctively local representation. However, one must ask certain questions before celebrating this as a significant discovery. In PDP it is recognized that the lower-level coefficients appear to be grow- ing exponentially, yet no alarm is expressed about this. In fact, anyone who reads section 7.3 should recognize such a network as employing precisely the type of computational structure that we called stratification. Also, in the case of network 2, the learning procedure required 1,208 cycles through each of the 64 possible examples — a total of 77,312 trials (enough to make us wonder if the time for this procedure to determine suitable coefficients increases exponentially with the size of the retina). PDP does not address this question. What happens when the retina has 100 elements? If such a network required on the order of 2 200 trials to learn, most observers would lose interest. This observation shows most starkly how we and the authors of PDP differ in interpreting the implications of our theory. Our “pes- [254] Epilogue OUTPUT Figure 3 Parity using Gamba masks. simistic evaluation of the perceptron” was the assertion that, al- though certain problems can easily by solved by perceptrons on small scales, the computational costs become prohibitive when the problem is scaled up. The authors of PDF seem not to recognize that the coefficients of this symmetry machine confirm that thesis, and celebrate this performance on a toy problem as a success rather than asking whether it could become a profoundly “bad” form of behavior when scaled up to problems of larger size. Both of these networks are in the class of what we called Gamba perceptrons in section 13.1 — that is, ordinary perceptrons whose cp functions are themselves perceptrons of order 1. Accordingly, we are uncomfortable about the remark in PDF that “multilayer linear threshold networks are potentially much more powerful than sin- gle-layer perceptrons.” Of course they are, in various ways — and chapter 8 of PDF describes several studies of multilayer percep- tron-like devices. However, most of them — like figure 2 above — still belong to the class of networks discussed in Perceptrons. Also in chapter 8 of PDF, similar methods are applied to the prob- lem of recognizing parity — and the very construction described in our section 13.1, through which a Gamba perceptron can recognize parity, is rediscovered. Figure 3 here shows the results. To learn these coefficients, the procedure described in PDF required 2,825 cycles through the 16 possible input patterns, thus consuming 45,200 trials for the network to learn to compute the parity predi- cate for only four inputs. Is this a good result or a bad result? We cannot tell without more knowledge about why the procedure re- quires so many trials. Until one has some theory of that, there is no Epilogue [255] way to assess the significance of any such experimental result; all one can say is that 45,200 = 45,200. In section 10.1 we saw that if a perceptron’s cp functions include only masks, the parity predicate requires doubly exponential coefficients. If we were sure that that was happening, this would suggest to us that we should represent 45,200 (approximately) as 2 2 rather than, say, as 2 16 . However, here we suspect that this would be wrong, because the input units aren’t masks but predicates — apparently provided from the start — that already know how to “count.” These make the problem much easier. In any case, the lesson of Perceptrons is that one cannot interpret the meaning of such an experimental report without first making further probes. Learning We haven’t yet said how those networks learned. The authors of PDP describe a learning procedure called the “Generalized Delta Rule” — we’ll call it GD — as a new breakthrough in connectionist research. To explain its importance, they depict as follows the theo- retical situation they inherited: “A further argument advanced by Minsky and Papert against per- ceptron-like models with hidden units is that there was no indica- tion how such multilayer networks were to be trained. One of the appealing features of the one-layer perceptron is the existence of a powerful learning procedure, the perceptron convergence proce- dure of Rosenblatt. In Minsky and Papert’s day, there was no such powerful learning procedure for the more complex multilayer sys- tems. This is no longer true. . . . The GD procedure provides a direct generalization of the perceptron learning procedure which can be applied to arbitrary networks with multiple layers and feed- back among layers. This procedure can, in principle, learn arbi- trary functions including, of course, parity and connectedness.” (vol. 1, p. 113) In Minsky and Papert’s day, indeed! In this section we shall ex- plain why, although the GD learning procedure embodies some useful ideas, it does not justify such sweeping claims. But in order to explain why, and to see how the approach in the current wave of connectionism differs from that in Perceptrons, we must first ex- amine with some care the relationship between two branches of perceptron theory which could be called “theory of learning” and “theory of representation.” To begin with, one might paraphrase [256] Epilogue the above quotation as saying that, until recently, connectionism had been paralyzed by the following dilemma: Perceptrons could learn anything that they could represent, but they were too limited in what they could represent. Multilayered networks were less limited in what they could repre- sent, but they had no reliable learning procedure. According to the classical theory of perceptrons, those limitations on representability depend on such issues as whether a given predi- cate P can be represented as a perceptron defined by a given set <t> on a given retina, whether P is of finite order, whether P can be realized with coefficients of bounded size, whether properties of several representable predicates are inherited by combinations of those predicates, and so forth. All the results in the first half of our book are involved with these sorts of representational issues. Now, when one speaks about “powerful learning procedures,” the situation is complicated by the fact that, given enough input units of sufficiently high order, even simple perceptrons can represent — and therefore learn — arbitrary functions. Consequently, it makes no sense to speak about “power” in absolute terms. Such state- ments must refer to relative measures of sizes and scales. As for learning, the dependability of Rosenblatt’s Perceptron Con- vergence theorem of section 11.1 — let’s call it PC for short — is very impressive: If it is possible at all to represent a predicate P as a linear threshold function of a given set of predicates d>, then the PC procedure will eventually discover some particular set of coefficients that actually represents P. However, this is not, in itself, a sufficient reason to consider PC interesting and important, because that theorem says nothing about the crucial issue of efficiency. PC is not interesting merely because it provides a sys- tematic way to find suitable coefficients. One could always take recourse, instead, to simple, brute-force search — because, given that some solution exists, one could simply search through all pos- sible integer coefficient vectors, in order of increasing magnitude, until no further “errors” occurred. But no one would consider such an exhaustive process to be an interesting foundation for a learning theory. Epilogue [257] What, then, makes PC seem significant? That it discovers those coefficients in ways that are intriguing in several other important respects. The PC procedure seems to satisfy many of the intuitive requirements of those who are concerned with modeling what really happens in a biological nervous system. It also appeals to both our engineering aesthetic and our psychological aesthetic by serving simultaneously as both a form of guidance by error correc- tion and a form of hill-climbing. In terms of computational effi- ciency, PC seems much more efficient than brute-force procedures (although we have no rigorous and general theory of the condi- tions under which that will be true). Finally, PC is so simple mathe- matically as to make one wish to believe that it reflects something real. Hill-Climbing and the Generalized Delta Procedure Suppose we want to find the maximum value of a given function F(x,y,z, . . .) of n variables. The extreme brute-force solution is to calculate the function for all sets of values for the variables and then select the point for which F had the largest value. The ap- proach we called hill-climbing in section 11.3 is a local procedure designed to attempt to find that global maximum. To make this subject more concrete, it is useful to think of the two-dimensional case in which the x-y plane is the ground and z = F(x,y) is the elevation of the point (x,y,z) on the surface of a real physical hill. Now, imagine standing on the hill in a fog so dense that only the immediate vicinity is visible. Then the only resort is to use some diameter-limited local process. The best-known method is the method known as “steepest ascent,” discussed in section 11.6: First determine the slope of the surface in various directions from the point where you are standing, then choose the direction that most rapidly increases your altitude and take a step of a certain size in that direction. The hope is that, by thus climbing the slope, you will eventually reach the highest point. It is both well known and obvious that hill-climbing does not al- ways work. The simplest way to fail is to get stuck on a local maximum — an isolated peak whose altitude is relatively in- significant. There simply is no local way for a hill-climbing proce- dure to be sure that it has reached a global maximum rather than some local feature of topography (such as a peak, a ridge, or a plain) on which it may get trapped. We showed in section 11.6 that PC is equivalent (in a peculiar sense) to a hill-climbing procedure that works its way to the top of a hill whose geometry can actually [258] Epilogue be proved not to have any such troublesome local features — provided that there actually exists some perceptron-weight vector solution A* to the problem. Thus, one could argue that perceptrons work” on those problems not because of any particular virtue of the perceptrons or of their hill-climbing procedures but because the hills for those soluble problems have clean topographies. What are the prospects of finding a learning procedure that works equally well on all problems, and not merely on those that have linearly separable decision functions? The authors of PDP maintain that they have indeed discovered one: “Although our learning results do not guarantee that we can find a solution for all solvable problems, our analyses and results have shown that, as a practical matter, the error propagation scheme leads to solutions in virtually every case. In short, we believe that we have answered Minsky and Papert’s challenge and have found a learning result sufficiently powerful to demonstrate that their pes- simism about learning in multilayer machines was misplaced.” (vol. 1, p. 361) But the experiments in PDP , though interesting and ingenious, do not actually demonstrate any such thing. In fact, the “powerful new learning result” is nothing other than a straightforward hill- climbing algorithm, with all the problems that entails. To see how GD works, assume we are given a network of units interconnected by weighted, unidirectional links. Certain of these units are con- nected to input terminals, and certain others are regarded as output units. We want to teach this network to respond to each (vector) input pattern X p with a specified output vector Y p . How can we find a set of weights w = {w 0 } that will accomplish this? We could try to do it by hill-climbing on the space of Ws, provided that we could define a suitable measure of relative altitude or “success.” One problem is that there cannot be any standard, universal way to measure errors, because each type of error has different costs in different situations. But let us set that issue aside and do what scientists often do when they can’t think of anything better: sum the squares of the differences. So, if X(W,X) is the network’s out- put vector for internal weights W and inputs X, define the altitude function E( W) to be this sum: E( W) = - X [ Y P - Y (W, X p)] 2 - all input patterns p Epilogue [259] In other words, we compute our measure of success by presenting successively each stimulus X p to the network. Then we compute the (vector) difference between the actual output and the desired output. Finally, we add up the squares of the magnitudes of those differences. (The minus sign is simply for thinking of climbing up instead of down.) The error function E will then have a maximum possible value of zero, which will be achieved if and only if the machine performs perfectly. Otherwise there will be at least one error and E{ W) will be negative. Then all we have to is climb the hill E( W) defined over the (high-dimensional) space of weight vectors W. If our paths reaches a W for which E(W) is zero, our problem will be solved and we will be able to say that our machine has “learned from its experience.” Well use a process that climbs this hill by the method of steepest ascent. We can do this by estimating, at every step, the partial derivatives dE/dwy of the total error with respect to each compo- nent of the weight vector. This tells us the direction of the gradient vector dEldSN , and we then proceed to move a certain distance in that direction. This is the mathematical character of the General- ized Delta procedure, and it differs in no significant way from older forms of diameter-limited gradient followers. Before such a procedure can be employed, there is an obstacle to overcome. One cannot directly apply the method of gradient ascent to networks that contain threshold units. This is because the derivative of a step-function is zero, whenever it exists, and hence no gradient is defined. To get around this, PDP applies a smoothing function to make those threshold functions differentiable. The trick is to replace the threshold function for each unit with a mono- tonic and differentiable function of the sum of that unit’s inputs. This permits the output of each unit to encode information about the sum of its inputs while still retaining an approximation to the perceptron’s decision-making ability. Then gradient ascent be- comes more feasible. However, we suspect that this smoothing trick may entail a large (and avoidable) cost when the predicate to be learned is actually a composition of linear threshold functions. There ought to be a more efficient alternative based on how much each weight must be changed, for each stimulus, to make the local input sum cross the threshold. [260] Epilogue In what sense is the particular hill-climbing procedure GD more powerful than the perceptron’s PC? Certainly GD can be applied to more networks than PC can, because PC can operate only on the connections between one layer of (p units and a single output unit. GD, however, can modify the weights in an arbitrary multilayered network, including nets containing loops. Thus, in contrast to the perceptron (which is equipped with some fixed set of <ps that can never be changed), GP can be regarded as able to change the weights inside the cps. Thus GD promises, in effect, to be able discover useful new cp functions — and many of the experiments reported in PDP demonstrate that this often works. A natural way to estimate the gradient of E{ W) is to estimate dE/dw;j by running through the entire set of inputs for each weight. How- ever, for large networks and large problems that could be a hor- rendous computation. Fortunately, in a highly connected network, all those many components of the gradient are not independent of one another, but are constrained by the algebraic “chain rule” for the derivatives of composite functions. One can exploit those con- straints to reduce the amount of computation by applying the chain- rule formula, recursively, to the mathematical description of the network. This recursive computation is called “back-propagation” in PDP. It can substantially reduce the amount of calculation for each hill-climbing step in networks with many connections. We have the impression that many people in the connectionist community do not understand that this is merely a particular way to compute a gradient and have assumed instead that back-propagation is a new learning scheme that somehow gets around the basic limitations of hill-climbing. Clearly GD would be far more valuable than PC if it could be made to be both efficient and dependable. But virtually nothing has been proved about the range of problems upon which GD works both efficiently and dependably. Indeed, GD can fail to find a solution when one exists, so in that narrow sense it could be considered less powerful than PC. In the early years of cybernetics, everyone understood that hill- climbing was always available for working easy problems, but that it almost always became impractical for problems of larger sizes Epilogue [261] and complexities. We were very pleased to discover (see section 11.6) that PC could be represented as hill-climbing; however, that very fact led us to wonder whether such procedures could depend- ably be generalized, even to the limited class of multilayer ma- chines that we named Gamba perceptrons. The situation seems not to have changed much — we have seen no contemporary connec- tionist publication that casts much new theoretical light on the situation. Then why has GD become so popular in recent years? In part this is because it is so widely applicable, and because it does indeed yield new results (at least on problems of rather small scale). Its reputation also gains, we think, from its being presented in forms that share, albeit to a lesser degree, the biological plausi- bility of PC. But we fear that its reputation also stems from unfamiliarity with the manner in which hill-climbing methods dete- riorate when confronted with larger-scale problems. In any case, little good can come from statements like “as a practi- cal matter, GD leads to solutions in virtually every case” or “GD can, in principle, learn arbitrary functions.” Such pronouncements are not merely technically wrong; more significantly, the pretense that problems do not exist can deflect us from valuable insights that could come from examining things more carefully. As the field of connectionism becomes more mature, the quest for a general solution to all learning problems will evolve into an understanding of which types of learning processes are likely to work on which classes of problems. And this means that, past a certain point, we won’t be able to get by with vacuous generalities about hill-climbing. We will really need to know a great deal more about the nature of those surfaces for each specific realm of problems that we want to solve. On the positive side, we applaud those who bravely and roman- tically are empirically applying hill-climbing methods to many new domains for the first time, and we expect such work to result in important advances. Certainly these researchers are exploring net- works with architectures far more complex than those of percep- trons, and some of their experiments already have shown indica- tions of new phenomena that are well worth trying to understand. Scaling Problems Up in Size Experiments with toy-scale problems have proved as fruitful in artificial intelligence as in other areas of science and engineering. [262] Epilogue Many techniques and principles that ultimately found real applica- tions were discovered and honed in microworlds small enough to comprehend yet rich enough to challenge our thinking. But not every phenomenon encountered in dealing with small models can be usefully scaled up. Looking at the relative thickness of the legs of an ant and an elephant reminds us that physical structures do not always scale linearly: an ant magnified a thousand times would collapse under its own weight. Much of the theory of computa- tional complexity is concerned with questions of scale. If it takes 100 steps to solve a certain kind of equation with four terms, how many steps will it take to solve the same kind of equation with eight terms? Only 200, if the problem scales linearly. But for other prob- lems it will take not twice 100 but 100 squared. For example, the Gamba perceptron of figure 2 needs only two cp functions rather than the six required in figure 1. In neither of these two toy-sized networks does the number seem alarmingly large. One network has fewer units; the other has smaller coefficients. But when we examine how those numbers grow with retinas of increasing size, we discover that whereas the coefficients of figure 1 remain constant, those of figure 2 grow exponentially. And, pre- sumably, a similar price must be paid again in the number of repeti- tions required in order to learn. In the examination of theories of learning and problem solving, the study of such growths in cost is not merely one more aspect to be taken into account; in a sense, it is the only aspect worth consider- ing. This is because so many problems can be solved “in principle” by exhaustive search through a suitable space of states. Of course, the trouble with that in practice is that there is usually an exponen- tial increase in the number of steps required for an exhaustive search when the scale of the problem is enlarged. Consequently, solving toy problems by methods related to exhaustive search rarely leads to practical solutions to larger problems. For example, though it is easy to make an exhaustive-search machine that never loses a game of noughts and crosses, it is infeasible to do the same for chess. We do not know if this fact is significant, but many of the small examples described in PDF could have been solved as quickly by means of exhaustive search — that is, by systematically assigning and testing all combinations of small integer weights. Epilogue [263] When we started our research on perceptrons, we had seen many interesting demonstrations of perceptrons solving problems of very small scale but not doing so well when those problems were scaled up. We wondered what was going wrong. Our first “handle” on how to think about scaling came with the concept of the order of a predicate. If a problem is of order N, then the number of cps for the corresponding perceptron need not increase any faster than as the Mh power of R. Then, whenever we could show that a given problem was of low order, we usually could demonstrate that per- ceptron-like networks could do surprisingly well on that problem. On the other hand, once we developed the more difficult tech- niques for showing that certain other problems have unbounded order, this raised alarming warning flags about extending their so- lutions to larger domains. Unbounded order was not the only source of scaling failures. An- other source — one we had not anticipated until the later stages of our work — involved the size, or rather the information content, of the coefficients. The information stored in connectionist systems is embodied in the strengths of weights of the connections between units. The idea that learning can take place by changing such strengths has a ring of biological plausibility, but that plausibility fades away if those strengths are to be represented by numbers that must be accurate to ten or twenty decimal orders of significance. The Problem of Sampling Variance Our description of the Generalized Delta Rule assumes that it is feasible to compute the new value of E( W) at every step of the climb. The processes discussed in chapter 8 of PDP typically re- quire only on the order of 100,000 iterations, a range that is easily accessible to computers (but that might in some cases strain our sense of biological plausibility). However, it will not be practical, with larger problems, to cycle through all possible input patterns. This means that when precise measures of E( W) are unavailable, we will be forced to act, instead, on the basis of incomplete sam- ples — for example, by making a small hill-climbing step after each reaction to a stimulus. (See the discussion of complete versus in- cremental methods in subsection 12.1.1.) When we can no longer compute dE/dW precisely but can only estimate its components, then the actual derivative will be masked by a certain amount of [264] Epilogue sampling noise. The text of PDP argues that using sufficiently small steps can force the resulting trajectory to come arbitrarily close to that which would result from knowing dE/d\V precisely. When we tried to prove this, we were led to suspect that the choice of step size may depend so much on the higher derivatives of the smooth- ing functions that large-scale problems could require too many steps for such methods to be practical. So far as we could tell, every experiment described in chapter 8 of PDP involved making a complete cycle through all possible input situations before making any change in weights. Whenever this is feasible, it completely eliminates sampling noise — and then even the most minute correlations can become reliably detectable, be- cause the variance is zero. But no person or animal ever faces situations that are so simple and arranged in so orderly a manner as to provide such cycles of teaching examples. Moving from small to large problems will often demand this transition from exhaustive to statistical sampling, and we suspect that in many realistic situa- tions the resulting sampling noise would mask the signal com- pletely. We suspect that many who read the connectionist litera- ture are not aware of this phenomenon, which dims some of the prospects of successfully applying certain learning procedures to large-scale problems. Problems of Scaling In principle, connectionist networks offer all the potential of uni- versal computing devices. However, our examples of order and coefficient size suggest that various kinds of scaling problems are likely to become obstacles to attempts to exploit that potential. Fortunately, our analysis of perceptrons does not suggest that con- nectionist networks need always encounter these obstacles. In- deed, our book is rich in surprising examples of tasks that simple perceptrons can perform using relatively low-order units and small coefficients. However, our analysis does show that parallel net- works are, in general, subject to serious scaling phenomena. Con- sequently, researchers who propose such models must show that, in their context, those phenomena do not occur. The authors of PDP seem disinclined to face such problems. They seem content to argue that, although we showed that single-layer networks cannot solve certain problems, we did not know that Epilogue [265] there could exist a powerful learning procedure for multilayer net- works — to which our theorems no longer apply. However, strictly speaking, it is wrong to formulate our findings in terms of what perceptrons can and cannot do. As we pointed out above, percep- trons of sufficiently large order can represent any finite predicate. A better description of what we did is that, in certain cases, we established the computational costs of what perceptrons can do as a function of increasing problem size. The authors of PDP show little concern for such issues, and usually seem content with exper- iments in which small multilayer networks solve particular in- stances of small problems. What should one conclude from such examples? A person who thinks in terms of can versus can't will be tempted to suppose that if toy machines can do something, then larger machines may well do it better. One must always probe into the practicality of a pro- posed learning algorithm. It is no use to say that 4 ‘procedure P is capable of learning to recognize pattern X ” unless one can show that this can be done in less time and at less cost than with exhaus- tive search. Thus, as we noted, in the case of symmetry, the authors of PDP actually recognized that the coefficients were growing as powers of 2, yet they did not seem to regard this as suggesting that the experiment worked only because of its very small size. But scientists who exploit the insights gained from studying the single- layer case might draw quite different conclusions. The authors of PDP recognize that GD is a form of hill-climber, but they speak as though becoming trapped on local maxima were rarely a serious problem. In reporting their experiments with learn- ing the XOR predicate, they remark that this occurred “in only two cases ... in hundreds of times.” However, that experiment in- volved only the toy problem of learning to compute the XOR of two arguments. We conjecture that learning XOR for larger numbers of variables will become increasingly intractable as we increase the numbers of input variables, because by its nature the underlying parity function is absolutely uncorrelated with any function of fewer variables. Therefore, there can exist no useful correlations among the outputs of the lower-order units involved in computing it, and that leads us to suspect that there is little to gain from following whatever paths are indicated by the artificial introduc- tion of smoothing functions that cause partial derivatives to exist. [266] Epilogue The PDP experimenters encountered a more serious local-maxi- mum problem when trying to make a network learn to add two binary numbers — a problem that contains an embedded XOR prob- lem. When working with certain small networks, the system got stuck reliably. However, the experimenters discovered an inter- esting way to get around this difficulty by introducing longer chains of intermediate units. We encourage the reader to study the discus- sion starting on page 341 of PDP and try to make a more complete theoretical analysis of this problem. We suspect that further study of this case will show that hill-climbing procedures can indeed get multilayer networks to learn to do multidigit addition. However, such a study should be carried out not to show that “networks are good” but to see which network architectures are most suitable for enabling the information required for “carrying” to flow easily from the smaller to the larger digits. In the PDP experiment, the network appears to us to have started on the road toward inventing the technique known to computer engineers as “carry jumping.” To what extent can hill-climbing systems be made to solve hard problems? One might object that this is a wrong question because “hard” is so ill defined. The lesson of Perceptrons is that we must find ways to make such questions meaningful. In the case of hill- climbing, we need to find ways to characterize the types of prob- lems that lead to the various obstacles to climbing hills, instead of ignoring those difficulties or trying to find universal ways to get around them. The Society of Mind The preceding section was written as though it ought to be the principal goal of research on network models to determine in which situations it will be feasible to scale their operations up to deal with increasingly complicated problems. But now we propose a some- what shocking alternative: Perhaps the scale of the toy problem is that on which, in physiological actuality, much of the functioning of intelligence operates. Accepting this thesis leads into a way of thinking very different from that of the connectionist movement. We have used the phrase “society of mind” to refer to the idea that mind is made up of a large number of components, or “agents,” each of which would operate on the scale of what, if taken in Epilogue [267] isolation, would be little more than a toy problem. [See Marvin Minsky, The Society of Mind (Simon and Schuster, 1987) and Sey- mour Papert, Mindstorms (Basic Books, 1982).] To illustrate this idea, let’s try to compare the performance of the symmetry perceptron in PDF with human behavior. An adult hu- man can usually recognize and appreciate the symmetries of a kaleidoscope, and that sort of example leads one to imagine that people do very much better than simple perceptrons. But how much can people actually do? Most people would be hard put to be certain about the symmetry of a large pattern. For example, how long does it take you to decide whether or not the following pattern is symmetrical? DB4HWUK85HCNZEWJKRKJWEZNCH58KUWH4BD In many situations, humans clearly show abilities far in excess of what could be learned by simple, uniform networks. But when we take those skills apart, or try to find out how they were learned, we expect to find that they were made by processes that somehow combined the work (already done in the past) of many smaller agencies, none of which, separately, need to work on scales much larger than do those in PDF. Is this hypothesis consistent with the PDP style of connectionism? Yes, insofar as the computations of the nervous system can be represented as the operation of societies of networks. But no, insofar as the mode of operation of those societies of networks (as we imagine them) raises theoretical issues of a different kind. We do not expect procedures such as GD to be able to produce such societies. Something else is needed. What that something must be depends on how we try to extend the range of small connectionist models. We see two principal alterna- tives. We could extend them either by scaling up small connection- ist models or by combining small-scale networks into some larger organization. In the first case, we would expect to encounter theo- retical obstacles to maintaining GD’s effectiveness on larger, deeper nets. And despite the reputed efficacy of other alleged rem- edies for the deficiencies of hill-climbing, such ^s “annealing,” we stay with our research conjecture that no such procedures will work very well on large-scale nets, except in the case of problems that turn out to be of low order in some appropriate sense. The [268] Epilogue second alternative is to employ a variety of smaller networks rather than try to scale up a single one. And if we choose (as we do) to move in that direction, then our focus of concern as theoretical psychologists must turn toward the organizing of small nets into effective large systems. The idea that the lowest levels of thinking and learning may operate on toy-like scales fits many of our com- mon-sense impressions of psychology. For example, in the realm of language, any normal person can parse a great many kinds of sentences, but none of them past a certain bound of involuted complexity. We all fall down on expressions like “the cheese that the rat that the cat that the dog bit chased ate.” In the realm of vision, no one can count great numbers of things, in parallel, at a single glance. Instead, we learn to “estimate.” Indeed, the visual joke in figure 0.1 shows clearly how humans share perceptrons’ inability to easily count and match, and a similar example is em- bodied in the twin spirals of figure 5.1. The spiral example was intended to emphasize not only that low-order perceptrons cannot perceive connectedness but also that humans have similar limita- tions. However, a determined person can solve the problem, given enough time, by switching to the use of certain sorts of serial men- tal processes. Beyond Perceptrons No single-method learning scheme can operate efficiently for every possible task; we cannot expect any one type of machine to ac- count for any large portion of human psychology. For example, in certain situations it is best to carefully accumulate experience; however, when time is limited, it is necessary to make hasty generalizations and act accordingly. No single scheme can do all things. Our human semblance of intelligence emerged from how the brain evolved a multiplicity of ways to deal with different prob- lem realms. We see this as a principle that underlies the mind’s reality, and we interpret the need for many kinds of mechanisms not as a pessimistic and scientifically constraining limitation but as the fundamental source of many of the phenomena that artificial intelligence and psychology have always sought to understand. The power of the brain stems not from any single, fixed, universal principle. Instead it comes from the evolution (in both the individ- ual sense and the Darwinian sense) of a variety of ways to develop new mechanisms and to adapt older ones to perform new functions. Instead of seeking a way to get around that need for diversity, we Epilogue [269] have come to try to develop “society of mind” theories that will recognize and exploit the idea that brains are based on many differ- ent kinds of interacting mechanisms. Several kinds of evidence impel us toward this view. One is the great variety of different and specific functions embodied in the brain’s biology. Another is the similarly great variety of phenom- ena in the psychology of intelligence. And from a much more ab- stract viewpoint, we cannot help but be impressed with the practi- cal limitations of each “general” scheme that has been proposed — and with the theoretical opacity of questions about how they be- have when we try to scale their applications past the toy problems for which they were first conceived. Our research on perceptrons and on other computational schemes has left us with a pervasive bias against seeking a general, domain- independent theory of “how neural networks work.” Instead, we ought to look for ways in which particular types of network models can support the development of models of particular domains of mental function — and vice versa. Thus, our understanding of the perceptron’s ability to perform geometric tasks was actually based on theories that were more concerned with geometry than with networks. And this example is supported by a broad body of expe- rience in other areas of artificial intelligence. Perhaps this is why the current preoccupation of connectionist theorists with the search for general learning algorithms evokes for us two aspects of the early history of computation. First, we are reminded of the long line of theoretical work that culminated in the “pessimistic” theories of Godel and Turing about the limitations on effective computability. Yet the realization that there can be no general-purpose decision procedure for mathe- matics had not the slightest dampening effect on research in mathe- matics or in computer science. On the contrary, awareness of those limiting discoveries helped motivate the growth of rich cultures involved with classifying and understanding more specialized al- gorithmic methods. In other words, it was the realization that seek- ing overgeneral solution methods would be as fruitless as — and equivalent to — trying to solve the unsolvable halting problem for Turing machines. Abandoning this then led to seeking progress in more productive directions. [270] Epilogue Our second thought is about how the early research in artificial intelligence tended to focus on general-purpose algorithms for rea- soning and problem solving. Those general methods will always play their roles, but the most successful applications of AI research gained much of their practical power from applying specific knowl- edge to specific domains. Perhaps that work has now moved too far toward ignoring general theoretical considerations, but by now we have learned to be skeptical about the practical power of unre- strained generality. Interaction and Insulation Evolution seems to have anticipated these discoveries. Although the nervous system appears to be a network, it is very far from being a single, uniform, highly interconnected assembly of units that each have similar relationships to the others. Nor are all brain cells similarly affected by the same processes. It would be better to think of the brain not as a single network whose elements operate in accord with a uniform set of principles but as a network whose components are themselves networks having a large variety of dif- ferent architectures and control systems. This “society of mind” idea has led our research perspective away from the search for algorithms, such as GD, that were hoped to work across many domains. Instead, we were led into trying to understand what specific kinds of processing would serve specific domains. We recognize that the idea of distributed, cooperative processing has a powerful appeal to common sense as well to computational and biological science. Our research instincts tell us to discover as much as we can about distributed processes. But there is another concept, complementary to distribution, that is no less strongly supported by the same sources of intuition. We’ll call it insulation. Certain parallel computations are by their nature synergistic and cooperative: each part makes the others easier. But the And/Or of theorem 4.0 shows that under other circumstances, attempting to make the same network perform two simple tasks at the same time leads to a task that has a far greater order of difficulty. In those sorts of circumstances, there will be a clear advantage to having mechanisms, not to connect things together, but to keep such tasks apart. How can this be done in a connectionist net? Some recent work hints that even simple multilayer perceptron-like nets can Epilogue [271] learn to segregate themselves into quasi-separate components — and that suggests (at least in principle) research on uniform learn- ing procedures. But it also raises the question of how to relate those almost separate parts. In fact, research on networks in which different parts do different things and learn those things in different ways has become our principal concern. And that leads us to ask how such systems could develop managers for deciding, in differ- ent circumstances, which of those diverse procedures to use. For example, consider all the specialized agencies that the human brain employs to deal with the visual perception of spatial scenes. Although we still know little about how all those different agencies work, the end result is surely even more complex than what we described in section 13.4. Beyond that, human scene analysis also engages our memories and goals. Furthermore, in addition to all the systems we humans use to dissect two-dimensional scenes into objects and relationships, we also possess machinery for exploiting stereoscopic vision. Indeed, there appear to be many such agen- cies — distinct ones that employ, for example, motion cues, dis- parities, central correlations of the Julesz type, and memory-based frame-array-like systems that enable us to imagine and virtually “see” the occluded sides of familiar objects. Beyond those, we seem also to have been supplied with many other visual agencies — for example, ones that are destined to learn to recognize faces and expressions, visual cliffs, threatening movements, sexual attrac- tants, and who knows how many others that have not been discov- ered yet. What mechanisms manage and control the use of all those diverse agencies? And from where do those managers come? Stages of Development In Mindstorms and in The Society of Mind, we explained how the idea of intermediate, hidden processes might well account for some phenomena discovered by Piaget in his experiments on how chil- dren develop their concepts about the “conservation of quantity.” We introduced a theory of mental growth based on inserting, at various times, new inner layers of “management” into already existing networks. In particular, we argued that, to learn to make certain types of comparisons, a child’s mind must construct a mul- tilayer structure that we call a “society-of-more.” The lower levels of that net contain agents specialized to make a variety of spatial [272] Epilogue and temporal observations. Then the higher-level agents learn to classify, and then control, the activities of the lower ones. We certainly would like to see a demonstration of a learning process that could spontaneously produce the several levels of agents needed to embody a concept as complex as that. Chapter 17 of The Society of Mind offers several different reasons why this might be very difficult to do except in systems under systematic controls, both temporal and architectural. We suspect that it would require far too long, in comparison with an infant’s months of life, to create sophisticated agencies entirely by undirected, spontaneous learn- ing. Each specialized network must begin with promising ingre- dients that come either from prior stages of development or from some structural endowment that emerged in the course of organic evolution. When should new layers of control be introduced? If managers are empowered too soon, when their workers still are too immature, they won’t be able to accomplish enough. (If every agent could learn from birth, they would all be overwhelmed by infantile ideas.) But if the managers arrive too late, that will retard all further growth. Ideally, every agency’s development would be controlled by yet another agency equipped to introduce new agents just when they are needed — that is, when enough has been learned to justify the start of another stage. However, that would require a good deal of expertise on the controlling agency’s part. Another way — much easier to evolve — would simply enable various agencies to estab- lish new connections at genetically predetermined times (perhaps while also causing lower-level parts to slow further growth). Such a scheme could benefit a human population on the whole, although it might handicap individuals who, for one reason or another, hap- pen to move ahead of or behind that inborn “schedule.” In any case, there are many reasons to suspect that the parts of any sys- tem as complex as a human mind must grow through sequences of stage-like episodes. Architecture and Specialization The tradition of connectionism has always tried to establish two claims: that connectionist networks can accomplish interesting tasks and that they can learn to do so with no explicit program- ming. But a closer look reveals that rarely are those two virtues Epilogue [273] present in the same device. It is true that networks, taken as a class, can do virtually anything. However, each particular type of network can best learn only certain types of things. Each particular network we have seen seems relatively limited. Yet our wondrous brains are themselves composed of connected networks of cells. We think that the difference in abilities comes from the fact that a brain is not a single, uniformly structured network. Instead, each brain contains hundreds of different types of machines, intercon- nected in specific ways which predestine that brain to become a large, diverse society of partially specialized agencies. We are born with specific parts of our brains to serve every sense and muscle group, and with perhaps separate sections for physical and social matters (e.g., natural sounds versus social speech, inanimate scenes versus facial expressions, mechanical contacts versus so- cial caresses). Our brains also embody proto-specialists involved with hunger, laughter, anger, fear, and perhaps hundreds of other functions that scientists have not yet isolated. Many thousands of genes must be involved in constructing specific internal architec- tures for each of those highly evolved brain centers and in laying out the nerve bundles that interconnect them. And although each such system is embodied in the form of a network-based learning system, each almost surely also learns in accord with somewhat different principles. Why did our brains evolve so as to contain so many specialized parts? Could not a single, uniform network learn to structure itself into divisions with appropriate architectures and processes? We think that this would be impractical because of the problem of repre- senting knowledge. In order for a machine to learn to recognize or perform X , be it a pattern or a process, that machine must in one sense or another learn to represent or embody X. Doing that efficiently must exploit some happy triadic relationship between the structure of A, the learning procedure, and the initial architec- ture of the network. It makes no sense to seek the “best” network architecture or learning procedure because it makes no sense to say that any network is efficient by itself: that makes sense only in the context of some class of problems to be solved. Different kinds of networks lend themselves best to different kinds of representa- tions and to different sorts of generalizations. This means that the study of networks in general must include attempts, like those in [274] Epilogue this book, to classify problems and learning processes; but it must also include attempts to classify the network architectures. This is why we maintain that the scientific future of connectionism is tied not to the search for some single, universal scheme to solve all problems at once but to the evolution of a many-faceted technology of “brain design” that encompasses good technical theories about the analysis of learning procedures, of useful architectures, and of organizational principles to use when assembling those compo- nents into larger systems. Symbolic versus Distributed Let us now return to the conflict posed in our prologue: the war between the connectionists and the symbolists. We hope to make peace by exploiting both sides. There are important virtues in the use of parallel distributed net- works. They certainly often offer advantages in simplicity and in speed. And above all else they offer us ways to learn new skills without the pain and suffering that might come from comprehend- ing how. On the darker side, they can limit large-scale growth because what any distributed network learns is likely to be quite opaque to other networks connected to it. Symbolic systems yield gains of their own, in versatility and un- limited growth. Above all else they offer us the prospect that com- puters share: of not being bound by the small details of the parts of which they are composed. But that, too, has its darker side: sym- bolic processes can evolve worlds of their own, utterly divorced from their origins. Perceptrons can never go insane — but the same cannot be said of a brain. Now, what are symbols, anyway? We usually conceive of them as compact things that represent more complex things. But what, then, do we mean by represent ? It simply makes no sense, by itself, to say that “S represents T,” because the significance of a symbol depends on at least three participants: on S , on T, and on the context of some process or user U. What, for example, connects the word table to any actual, physical table? Since the words peo- ple use are the words people learn, clearly the answer must be that there is no direct relationship between S and J, but that there is a more complex triadic relationship that connects a symbol, a thing, Epilogue [275] and a process that is active in some person’s mind. Furthermore, when the term symbol is used in the context of network psychol- ogy, it usually refers to something that is reassignable so that it can be made to represent different things and so that the symbol-using processes can learn to deal with different symbols. What do we mean by distributed ? This usually refers to a system in which each end-effect comes not from any single, localized ele- ment-part, but from the interactions of many contributors, all working at the same time. Accordingly, in order to make a desired change in the output of a distributed system, one must usually alter a great many components. And changing the output of any particu- lar component will rarely have a large effect in any particular cir- cumstance; instead, such changes will tend to have small effects in many different circumstances. Symbols are tokens or handles with which one specialist can ma- nipulate representations within another specialist. But now, sup- pose that we want one agency to be able to exploit the knowledge in another agency. So long as we stay inside a particular agency, it may be feasible to use representations that involve great hosts of internal interactions and dependencies. But the fine details of such a representation would be meaningless to any outside agency that lacks access to, or the capacity to deal with, all that fine detail. Indeed, if each representation in the first agency involves activities that are uniformly distributed over a very large network, then di- rect communication to the other agency would require so many connection paths that both agencies would end up enmeshed to- gether into a single, uniform net — and then all the units of both would interact. How, then, could networks support symbolic forms of activities? We conjecture that, inside the brain, agencies with different jobs are usually constrained to communicate with one another only through neurological bottlenecks (i.e., connections between rela- tively small numbers of units that are specialized to serve as sym- bolic recognizers and memorizers). The recognizers learn to en- code significant features of the representation active in the first network, and the memorizers learn to evoke an activity that can serve a corresponding function in the receiving network. But in order to prevent those features from interfering too much with one [276] Epilogue another, there must be an adequate degree of insulation between the units that serve these purposes. And that need for insulation can lead to genuine conflicts between the use of symbolic and dis- tributed representations. This is because distributed representa- tions make it hard to combine (in arbitrary, learnable ways) the different fragments of knowledge embodied in different representa- tions. The difficulty arises because the more distributed is the rep- resentation of each fragment, the fewer fragments can be simulta- neously active without interfering with one another. Sometimes those interactions can be useful, but in general they will be destruc- tive. This is discussed briefly in section 8.2 of The Society of Mind: “The advantages of distributed systems are not alternatives to the advantages of insulated systems: the two are complementary. To say that the brain may be composed of distributed systems is not the same as saying that it is a distributed system — that is, a single network in which all functions are uniformly distributed. We do not believe that any brain of that sort could work, because the interac- tions would be uncontrollable. To be sure, we have to explain how different ideas can become connected to one another — but we must also explain what keeps our separate memories intact. For ex- ample, we praised the power of metaphors that allow us to mix the ideas we have in different realms — but all that power would be lost if all our metaphors got mixed! Similarly, the architecture of a mind-society must encourage the formation and maintenance of distinct levels of management by preventing the formation of con- nections between agencies whose messages have no mutual significance. Some theorists have assumed that distributed systems are inherently both robust and versatile but, actually, those attri- butes are more likely to conflict. Systems with too many interac- tions of different types will tend to be fragile, while systems with too many interactions of similar types will tend to be too redundant to adapt to novel situations and requirements.” A larger-scale problem is that the use of widely distributed repre- sentations will tend to oppose the formulation of knowledge about knowledge. This is because information embodied in distributed form will tend to be relatively inaccessible for use as a subject upon which other knowledge-based processes can operate. Conse- quently (we conjecture), systems that use highly distributed repre- sentations will tend to become conceptual dead ends as a result of their putting performance so far ahead of comprehension as to Epilogue [277] retard the growth of reflective thought. Too much diffusing of in- formation can make it virtually impossible (for other portions of the brain) to find out how results, however useful, are obtained. This would make it very difficult to dissect out the components that might otherwise be used to construct meaningful variations and generalizations. Of course such problems won’t become evident in experiments with systems that do only simple things, but we can expect to see such problems grow when systems try to learn to do more complex things. With highly distributed systems, we should anticipate that the accumulation of internal interactions may even- tually lead to intractable credit-assignment problems. Perhaps the only ultimate escape from the limitations of internal interactions is to evolve toward organizations in which each network affects others primarily through the use of serial operations and special- ized short-term-memory systems, for although seriality is rela- tively slow, its uses makes it possible to produce and control in- teractions between activities that occur at different and separate places and times. The Parallel Paradox It is often argued that the use of distributed representations enables a system to exploit the advantages of parallel processing. But what are the advantages of parallel processing? Suppose that a certain task involves two unrelated parts. To deal with both concurrently, we would have to maintain their representations in two decoupled agencies, both active at the same time. Then, should either of those agencies become involved with two or more subtasks, we would have to deal with each of them with no more than a quarter of the available resources. If that proceeded on and on, the system would become so fragmented that each job would end up with virtually no resources assigned to it. In this regard, distribution may oppose parallelism: the more distributed a system is — that is, the more intimately its parts interact — the fewer different things it can do at the same time. On the other side, the more we do separately in parallel, the less machinery can be assigned to each element of what we do, and that ultimately leads to increasing fragmentation and incompetence. This is not to say that distributed representations and parallel pro- cessing are always incompatible. When we simultaneously activate [278] Epilogue two distributed representations in the same network, they will be forced to interact. In favorable circumstances, those interactions can lead to useful parallel computations, such as the satisfaction of simultaneous constraints. But that will not happen in general; it will occur only when the representations happen to mesh in suit- ably fortunate ways. Such problems will be especially serious when we try to train distributed systems to deal with problems that require any sort of structural analysis in which the system must represent relationships between substructures of related types — that is, problems that are likely to compete for the same limited resources. On the positive side, there are potential virtues to embodying knowledge in the form of networks of units with weighted intercon- nections. For example, distributed representations can sometimes be used to gain the robustness of redundancy, to make machines that continue to work despite having injured, damaged, or unreli- able components. They can embody extremely simple learning al- gorithms, which operate in parallel with great speed. Representations and Generalizations It is often said that distributed representations are inherently pos- sessed of useful holistic qualities; for example, that they have in- nate tendencies to recognize wholes from partial cues — even for patterns they have not encountered before. Phenomena of that sort are often described with such words as generalization, induction, or gestalt . Such phenomena certainly can emerge from connection- ist assemblies. The problem is that, for any body of experience, there are always many kinds of generalizations that can be made. The ones made by any particular network are likely to be inappro- priate unless there happens to be an appropriate relationship be- tween the network’s architecture and the manner in which the problem is represented. What makes architectures and representa- tions appropriate? One way to answer that is to study how they affect which signals will be treated as similar. Consider the problem of comparing an arbitrary input pattern with a collection of patterns in memory, to find which memory is most similar to that stimulus. In section 12.7 we conjectured that solving best-match problems will always be very tedious when serial hard- Epilogue [279] ware is used. PDP suggests another view in regard to parallel, distributed machines: “This is precisely the kind of problem that is readily implemented using highly parallel algorithms of the kind we consider.” This is, in some ways, plausible, since a sufficiently parallel machine could simultaneously match an input pattern against every pattern in its memory. And yet the assertion is quaintly naive, since best match means different things in different circumstances. Which answers should be accepted as best always depends on the domain of application. The very same stimulus may signify food to one animal, companionship to another, and a dangerous predator to a third. Thus, there can be no single, univer- sal measure of how well two descriptions match; every context requires appropriate schemes. Because of this, distributed net- works do not magically provide solutions to such best-match prob- lems. Instead, the functional architecture of each particular net- work imposes its own particular sort of metrical structure on the space of stimuli. Such structures may often be useful. Yet, that can give us no assurance that the outcome will correspond to what an expert observer would consider to be the very best match, given that observer’s view of what would be the most appropriate re- sponse in the current context or problem realm. We certainly do not mean to suggest that networks cannot perform useful matching functions. We merely mean to emphasize that dif- ferent problems entail different matching criteria, and that hence no particular type of network can induce a topology of similarity or nearness that is appropriate for every realm. Instead, we must assume that, over the course of time, each specialized portion of the brain has evolved a particular type of architecture that is rea- sonably likely to induce similarity relationships that are useful in performing the functions to which that organ is likely (or destined) to be assigned. Perhaps an important activity of future connection- ist research will be to develop networks that can learn to embody wide ranges of different, context-dependent types of matching functions. We have also often heard the view that machines that employ lo- calized or symbolic representations must be inherently less capa- ble than are distributed machines of insight, consciousness, or sense of self. We think this stands things on their heads. It is because our brains primarily exploit connectionist schemes that [280] Epilogue we possess such small degrees of consciousness, in the sense that we have so little insight into the nature of our own conceptual machinery. We agree that distributed representations probably are used in virtually every part of the brain. Consequently, each agency must learn to exploit the abilities of the others without having direct access to compact representations of what happens inside those other agencies. This makes direct insight infeasible; the best such agencies can do is attempt to construct their own models of the others on the basis of approximate, pragmatic mod- els based on presuppositions and concepts already embodied in the observing agency. Because of this, what appear to us to be direct insights into ourselves must be rarely genuine and usually conjec- tural. Accordingly, we expect distributed representations to tend to produce systems with only limited abilities to reflect accurately on how they do what they do. Thinking about thinking, we main- tain, requires the use of representations that are localized enough that they can be dissected and rearranged. Besides, distributed representations spread out the information that goes into them. The result of this is to mix and obscure the effects of their separate elements. Thus their use must entail a heavy price; surely, many of them must become “conceptual dead ends” because the perfor- mances that they produce emerge from processes that other agen- cies cannot comprehend. In other words, when the representations of concepts are distributed, this will tend to frustrate attempts of other agencies to adapt and transfer those concepts to other con- texts. How much, then, can we expect from connectionist systems? Much more than the above remarks might suggest, since reflective thought is the lesser part of what our minds do. Most probably, we think, the human brain is, in the main, composed of large numbers of relatively small distributed systems, arranged by embryology into a complex society that is controlled in part (but only in part) by serial, symbolic systems that are added later. But the subsymbolic systems that do most of the work from underneath must, by their very character, block all the other parts of the brain from knowing much about how they work. And this, itself, could help explain how people do so many things yet have such incomplete ideas of how those things are actually done. Bibliographic Notes The following remarks are intended to introduce the literature of this field. This is not to be considered an attempt at historical scholarship, for we have made no serious venture in that direc- tion. In a decade of work on the family of machines loosely called perceptrons, we find an interacting evolution and refinement of two ideas: first, the concept of realizing a predicate as a linear threshold function of much more local predicates; second, the idea of a convergence or “learning” theorem. The most com- monly held version of this history sees the perceptron invented by Rosenblatt in a single act, with the final proof of the con- vergence theorem vindicating his insight in the face of skepticism from the scientific world. This is an oversimplification, especially in its taking the concept of perceptron as static. For in fact a key part of the process leading to the convergence theorem was the molding of the concept of the machine to the appropriate form. (Indeed, how often does “finding the proof” of a conjec- ture involve giving the conjecture a more provable form?) In the early papers one sees a variety, both of machines and of “training” procedures, converging in the course of accumulation of mathematical insight toward the simple concepts we have used in this book. Students interested in this evolution can read: Rosenblatt, Frank (1959), “Two theorems of statistical separability in the perceptron,” Proceedings of a Symposium on the Mechanization of Thought Processes , Her Majesty’s Stationary Office, London, pp. 42 1 — 456; Rosenblatt, Frank (1962), Principles of Neurodynamics , Spartan Books, New York. In a variety of contexts, other perceptronlike learning experiments had been described. Quite well-known was the paper of Samuel, Arthur L. (1959), “Some studies in machine learning using the game of checkers,” IBM Journal of Research and Development , Vol. 3, No. 3, pp. 210-223 who describes a variety of error-correcting vector addition pro- cedures. In a later paper Samuel, Arthur L. (1967), “Some studies in machine learning using the game of checkers, Part II,” IBM Journal of Research and Development , Vol. 1 1, No. 4, pp. 601-618 he describes studies that lead toward detecting more complex [282] Bibliographic Notes interactions between the partial predicates. The simple multilayer perceptronlike machines discussed in Chapter 13 were described in Palmieri, G. and R. Sanna (1960), Methodos , Vol. 12, No. 48; Gamba, A., L. Gamberini, G. Palmieri, and R. Sanna (1961), ‘‘Further experiments with PAPA,” Nuovo Cimento Suppl. No. 2, Vol. 20, pp. 22 1 — 231. Some earlier reward-modified machines, further from the final form of the perceptron, are described in Ashby, W. Ross (1952), Design fora Brain , Wiley, New York; Clark, Wesley A., and Farley, B. G. (1955), “Generalization of pattern- recognition in a self-organizing system,” Proceedings 1955 Western Joint Computer Conference , pp. 85- 111; Minsky, M. (1954), “Neural nets and the brain-model problem,” doc- toral dissertation, Princeton University, Princeton, N.J.; Uttley, A. M. (1956), “Conditional probability machines,” in Automata Studies , Princeton University, Princeton, N.J., pp. 253-285. The proof of the convergence theorem (Theorem 11.1) is another example of this sort of evolution. In an abstract mathematical sense, both theorem and proof already existed before the percep- tron, for several people had considered the idea of solving a set of linear inequalities by “relaxation” methods— successive adjust- ments much like those used in the perceptron proceduce. An elegant early paper on this is Agmon, S. (1954), “The relaxation method for linear inequalities,” Canadian Journal of Mathematics , Vol. 6, No. 3, pp. 382-392. In Agmon’s procedure, one computes the <F-vector that gives the largest numerical error in the satisfaction of the linear inequality, and uses a multiple of that vector for correction. (See §11.4.) We do not feel sufficiently scholarly to offer an opinion on whether this paper should deserve priority for the discovery of p essible. the convergence theorem. It is- q t rite clear - that the theorem would have been instantly obvious had the cyberneticists interested in perceptrons known about Agmon’s work. In any case, the first proofs of the convergence theorem offered in cybernetic circles were quite independent of the work on linear inequalities. See, for example Bibliographic Notes [283] Block, H. D. (1962), “The perceptron: a model for brain functioning,” Reviews of Modern Physics , Vol. 34, No. 1, pp. 123 135. This proof was quite complicated. The first use known to us of the simpler kind of analysis used in §1 1 . 1 is in Papert, Seymour (1961), “Some Mathematical Models of Learning,” Proceedings of the Fourth London Symposium on Information Theory , C. Cherry, Editor, Academic Press, New York. Curiously, this paper is not mentioned by any later commentators (including the usually scholarly Nilsson) other than Rosenblatt in Neurodynamics. The convergence theorem is well discussed, in a variety of settings, by Nilsson, Nils (1965), Learning Machines , McGraw-Hill, New York, who includes a number of historical notes. Readers who consult the London Symposium volume might also read Minsky, Marvin, and Oliver G. Selfridge (1961), “Learning in neural nets,” Proceedings of the Fourth London Symposium on Information Theory , C. Cherry, Editor, Academic Press, New York. for some discussion of the relations between convergence and hill- climbing. Although Minsky and Papert did not yet know one another, their papers in that volume overlap to the extent of prov- ing the same theorem about the Bayesian optimality of linear separation. This coincidence had no obvious connection with their later collaboration. As Agmon had clearly anticipated the learning aspect of the perceptron, so had Selfridge anticipated its quality of combining local properties to yield apparently global ones. This is seen, for example, in Selfridge, Oliver G. (1956), “Pattern recognition and learning,” Pro- ceedings of the Third London Symposium of Information Theory , Aca- demic Press, New York, p. 345. Incidentally, we consider that there has been a strong influence of these cybernetic ideas on the trend of ideas and discoveries in the physiology of vision represented, for example, in Lettvin, Jerome Y., H. Maturana, W. S. McCulloch, and W. Pitts (1959), “What the frog’s eye tells the frog’s brain,” Proceedings of the IRE , Vol. 47, pp. 1940-1951 [284] Bibliographic Notes and Hubei, D. H., and T. N. Wiesel (1959), “Receptive fields of single neurons in the cat’s striate cortex,” Journal of Physiology , Vol. 148, pp. 574-591. Other ideas used in this book come from earlier models of physiological phenomena, notably the paper of Pitts, W., and W. S. McCulloch (1947), “How we know universals,” Bulletin of Mathematical Biophysics , Vol. 9, pp. 127-147 which is the first we know of that treats recognition invariant under a group by integrating or summing predicates over the group. This paper and that of Lettvin et al. are reprinted in McCulloch, Warren S. (1965), Embodiments of Mind , The M.I.T. Press, Cambridge, Mass., and this book reprints other early attempts to pass from the local to the global with networks of individually simple devices. In a third paper reprinted in Embodiments of Mind McCulloch, W. S., and Walter Pitts (1943), “A logical calculus of the ideas immanent in neural nets,” Bulletin of Mathematical Biophysics , Vol. 5, pp. 115-137 will be found the prototypes of the linear threshold functions themselves. Readers who are unfamiliar with this theory, or that of Turing machines, are directed to the elementary exposition in Minsky, Marvin (1967), Computation: Finite and Infinite Machines , Prentice-Hall, Englewood Cliffs, N.J. The local-global transition has dominated several biological areas in recent years. A most striking example is the trend in analysis of animal behavior associated with the name of Tinbergen, as in his classic Tinbergen, N. (1951), The Study of Instinct, Oxford, New York. Returning to the technical aspects of perceptrons, we find that our main subject is not represented at all in the literature. We know of no papers that either prove that a nontrivial perceptron cannot accomplish a given task or else show by mathematical analysis that a perceptron can be made to compute any significant geometric predicate. There is a vast literature about experimental results but generally these are so inconclusive that we will refrain from citing particular papers. In most cases that seem to show' Bibliographic Notes [285] “success,” it can be seen that the data permits an order- 1 separa- tion, or even a conjunctively local separation! In these cases, the authors do not mention this, though it seems inconceivable that they could not have noticed it! The approach closest to ours, though still quite distant, is that of Bledsoe, W. W., and I. Browning (1959), “Pattern recognition and reading by machine,” Proceedings of the Eastern Joint Computer Con- ference, 1959 , pp. 225-232. Another early paper that recognizes the curiously neglected fact that partial predicates work better when realistically matched to the problem, is Roberts, Lawrence G. (1960), “Pattern recognition with an adaptive network,” IRE International Convention Record , Part II, pp. 66-70. Rosenblatt made some studies (in Neurodynamics) concerning the probability that if a perceptron recognizes a certain class of fig- ures it will also recognize other figures that are similar in certain ways. In another paper Rosenblatt, Frank (1960), “Perceptual generalization over transforma- tion groups,” Self-Organizing Systems , Pergamon Press, New York, pp. 63-96. he considers group-invariant patterns but does not come close enough to the group-invariance theorem to get decisive results. The nearest approach to our negative results and methods is the analysis of ^ PARITY found in Dertouzos, Michael (1965), Threshold Logic: A Synthesis Approach , The M.I.T. Press, Cambridge, Mass. This is also a good book in which to see how people who are not interested in geometric aspects of perceptrons deal with linear threshold functions. They had already been interested, for other reasons, in the size of coefficients of (first-order) threshold func- tions, and we made use of an idea described in Myhill, John and W. H. Kautz (1961), “On the size of weights required for linear-input switching functions,” IRE Transactions on Electronic Computers , Vol. 10, No. 2, pp. 288-290 to get our theorem in §10.1. A more recent result on order- 1 coefficients is in Muroga, Saburo, and I. Toda (1966), “Lower bounds on the number of [286] Bibliographic Notes threshold functions,” IEEE Transactions on Electronic Computers , Vol. EC-15, No. 5, pp. 805-806, which improves upon an earlier result in Muroga, Saburo (1965), “Lower bounds on the number of threshold functions and a maximum weight,” IEEE Transactions on Electronic Computers , Vol. EC-14, No. 2, pp. 136-148. These papers also discuss another question: the proportion of Boolean functions (of ^-variables) that happens to be first-order. To our knowledge, there is no literature about the same question for higher-order functions. The general area of artificial intelligence and heuristic program- ming was mentioned briefly in Chapter 13 as the direction we feel one should look for advanced ideas about pattern recognition and learning. No systematic treatment is available of what is known in this area, but we can recommend a few general references. The collection of papers in Feigenbaum, Edward A., and Julian Feldman (1963), Computers and Thought , McGraw-Hill, New York. shows the state of affairs in the area up to about 1962, while Minsky, Marvin (1968), Semantic Information Processing , The M.ET. Press, Cambridge, Mass., 1968 contains more recent papers — mainly doctoral dissertations — dealing with computer programs that manipulate verbal and symbolic descriptions. Anyone interested in this area should also know the classic paper Newell, Allen, J. C. Shaw, and H. A. Simon (1959), “Report on a general problem-solving program,” Proceedings of International Con- ference on Information Processing , UNESCO House, pp. 256-264. The program mentioned in Chapter 13 is described in detail in Guzman, Adolfo (1968), “Decomposition of a visual scene into bodies,” Proceedings Fall Joint Computer Conference , 1968. Finally, we mention two early works that had a rather broad influence on cybernetic thinking. The fearfully simple homeostat concept mentioned in §1 1.6 is described in Ashby, W. Ross (1952), Design for a brain , Wiley, New York which discussed only very simple machines, to be sure, but for the Bibliographic Notes [287] first time with relentless clarity. At the other extreme, perhaps, was Hebb, Donald O. (1949), The Organization of Behavior , Wiley, New York which sketched a hierarchy of concepts proposed to account for global states in terms of highly local neuronal events. Although the details of such an enterprise have never been thoroughly worked out, Hebb’s outline was for many workers a landmark in the shift from a search for a single, simple principle of brain organization toward more realistic attempts to construct hier- archies (or rather heterarchies , as McCulloch would insist) that could support the variety of computations evidently needed for thinking. Vote like "ta compare yovir veaAiotU "fh<* wrffo those o+tai r readers. Th e are sen ous <Lsto$ston s <£ the book Bi ock Herbert A Rev/iecu " '?e\rc£^(ran$ t . <xv\J CovdYo\ vol. 17 \^10 pp . SOI ^55-3.. ; > ir . Kjea»el( Aden: A s4ep +°warc/ %e wJgrsM'hcj of Mptr* w+ioh Processes Sc igKicg- \/£>( 165 ^ 3.3. 1^69^ pjs y%0 ~~78cl Mycielski Jo.it Review o-f "Perce pl'eo as • Boll. Awer. Md+h. Soc. vol 7^ TaK 15. Minsky, M. Re-Vieav ok Peree^rcws . AT. He wo 3.^3 , AvRk'oaf I-*Uli<je»>ce > Cambn^e, Mass. oa. 13*?. fhe Bl(?cV review also coirkms or eArWisivc biUio^vA^^ Index A A*, a solution or separation vector, 164, 167 Ame. 189 A find' 189 Adaptive, 16, 19 Adjacent squares, 74, 83, 87 Agmon, S., 175. See also Bibliographic Notes Algebraic geometry, 66 And/Or theorem, 36, 62, 86, 1 10, 228, 240 Area, 54, 55,62,99, 130, 132 Armstrong, William, 245 ARPA, 246 Artificial intelligence, 232, 242. See also Heuristic programming Ashby, W. Ross, 81. See also Biblio- graphic Notes Associative memory, 2. See also Hash coding B Ball, Geofiry, 21 1 bayes, 193, 195-205 Beard, R., 245 Best match, 222, 224 BEST PLANE, 194 Beyer, Terry, 146-148, 183, 244-245 Bezout’s theorem, 66 Bionics, 242 Bledsoe, Woodrow W., 239, 245. See also Bibliographic Notes Block, H. D. See Bibliographic Notes ; l97 Blum, Manuel, 140, 245 Browning, Iben, 239. See also Biblio- raphic Notes C Cj, a stratification class, 1 15 Center of gravity, 55 Circle, 106 Clark, W. A. See Bibliographic Notes Closed under a group, 47, 241 Cluster, 191,211,233 Coefficients, 27, 70, 97, 126, 159, 243 size of, 15, 17, 18, 117, 151-160 Collapsing theorem, 77-80 Compact, 64, 160, 186. Infinite se- quences of points from a compact set always have limit points in the set. Spheres and closed line seg- ments are compact. The concept is discussed in all except the most mod- ern books on topology or real analysis. COMPLETE STORAGE, 189 Component (connected), 87 Computation time, 136, 143-150, 216 Computer, 227, 23 1 . See also Program Conjunctively local, 8, 9, 11, 103, 105, 129, 142 Conjunctive normal form, 79 Connectedness, 6, 8, 12, 13, 69-95, 136-150, 232,238, 241 Context, 98, 111-113 Convergence theorem for perceptrons, 15, 15, 164-187, 243 Convexity, 6, 7, 26, 35, 103, 133, 141 Cost of errors, 205 Cover, Tom, 214 Criticisms, 4, 15, 16, 165, 180, 189, 243 Curvature, 104, 133 Cycling theorem, 182 D D (diameter), 129 Data set, 188, 215 Description, 233 Dertouzos, M. See Bibliographic Notes Diameter-limited, 9, 12, 73, 103, 104, 131-135 Dilation group, 124 Distance, 191-192, 222,225 E Efron, Bradley, 183, 244 Equivalence of figures, 46, 1 14, 124 of predicates, 47 Estimators, 206. See also Probability Equivalence-class, 46 Error correction, 163 Euler number, 69, 86, 133, 134-135, 241 Exact match. See match F F, 25 F+, 166 F , 166 FALSE, 26 Fano, Robert M., 245 Farley, B. G. See Bibliographic Notes Feedback, 3, 162 Feigenbaum, E. A., 232. See also Bibliographic Notes [290] Index Feldman, Julian, 232. See also Biblio- graphic Notes Fell, Harriet, 245 Filter, 228 Finite order. See Order Finite state, 1 5, 140 Forgetting, 207, 215 G *-*.43 gX, 42 G(X ), 86 Gamba, A. ,228. See also Bibliographic Notes Gamba perceptrons, 12, 228-231 Gamberini, L. See Bibliographic Notes Geometric (property), 99, 243-244 (/-equivalence, 47 Gestalt, 20 Global, 2, 17, 19, 242. See also Local Godel number, 70 Group, of transformations, 22, 39, 41, 44, 96, 126 Group-invariance theorem, 22, 48, 96, 100, 102, 239-241 Guzman, Adolfo, 233, 255. See also Bibliographic Notes H hG, 44 Haar measure, 55 Hall, David, 211 Hash coding, 190, 219-221 Hebb, Donald O., 19. See also Biblio- graphic Notes Henneman, William, 245 Heuristic programming, 232, 233, 239 Hewitt, Carl, 140, 245 Hill-climbing, 163, 178,244 Hole (in component;, 87 Homeostat, 180, 244 Hubei, D. H. See Bibliographic Notes Huffman, David, 79, 1 13, 241 Hyperplane, 14, 195,240 I /*, 137 /(J 0, 31. The constant ( = 1) identity function. Incremental computation, 215, 225 Independence (statistical), 200 Infinite groups, 44, 48, 97, 99, 1 14 Infinite sets, 1 1, 27, 37, 97, I 14, 158 Integral, 54, 70, 133 Invariant of group, 41 topological, 86, 92 95. Definition: In- tuitively, any predicate that is un- changed when a figure is deformed without altering its connectedness properties or the inside-outside re- lationships among its components. Irreducible algebraic curve, 66 isodata, 194, 21 1 Iterative arrays, 146 K k , K\ used for the order of a perceptron or the degree of a polynomial Kautz, W. H., 160. See also Bibliog- raphic Notes L L(<P), 14, 28 Learning, 14, 15, 16, 18, 127, 149 150, 161 226,243 244 Lettvin, J. Y, See Bibli ographic Notes Likelihood ratio, 209 N-LickrL^^T.c.ft.. 2.V-6 Linear threshold function, 27, 31 Local, 2, 7, 10, 17, 73, 163, 235. 5^ also Global Logarithmic sort, 217 Loop, 3, 145,231 Lyons, L., 245 M McCulloch, Warren S., 55, 79, 239, 240. See also Bibliographic Notes Marill, Thomas, 240 Maturana, H. See Bibliographic Notes Mask, 22, 31,35, 153, 155,240 Match best, 222-226. See also Nearest neigh- bor exact, 2 1 5, 22 1 . See also Hash coding Maximum likelihood, 199, 202. See also BAYES Measure, 55, 228 Memory, 136, 141, 145, 149, 215, 216, 243, 249. See also Learning Metric, 7 1 Minsky, Marvin, 232. See also Biblio- graphic Notes Moment, 55, 99 Multilayer, 228 232 Muroga, S. See Bibliographic Notes Index [291] Myhill, John, 160. See also Biblio- graphic Notes N N, (JO, 60, 107 nearest neighbor, 150, 194-199. See also Match, best Net, 206 Neuron, 14, 19, 210, 234 Newell, Allen, 234. See also Biblio- graphic Notes Nilsson, Nils, 183, 244. See also Bibliographic Notes Nonseparable, 181 Normalization, 116, 126-127 O “One-in-a-box” theorem, 59 61, 69, 75, 112 Order, 12, 30, 35- 56, 62, 78, 239-241 Order-restricted, 12 P Palmieri, G. See Bibliographic Notes Papert, S. See Bibliographic Notes Parallel, 2, 3, 5, 17, 142 150, 241, 249. See also Serial Parity, 56-59, 83, 149, 151 158, 176, 230, 240, 241 Paterson, Michael, 92, 242, 245 Pattern recognition, 1 16, 227, 242, 244 Perception, human, 73, 238 Perceptron, 12 . See also /,(<£) Permutation group, 40, 46, 56 Perspective, 235. See also Three-dimen- sional predicates Phenomenon, mathematical, 228 Physiology, 14 Pitts, Walter, 55, 99, 240. See also Bibliographic Notes Polynomial of a predicate, 23, 41, 57, 60, 63 Positive normal form, 33, 34, 240 Predicate, 6, 25, 26 Predicate-scheme, 25, 37 Preprocessing, 1 13 Probability, 14-15, 165, 193-209, 239 Programs, 9, 14, 136-139, 164-167, 226, 232-238 Pushdown list, 71 Q Query word, 188 R /?, Retina, 5, 25, 26 I R I , the number of points in the retina Rectangle, 104, 130, 134 Reflection symmetry, 1 17 Reinforcement, 161,206,215 Repeated stratification, 121 Resolution. See Tolerance Restrictions (on perceptrons). 5, 9, 12, 231 Roberts, L. G. See Bibliographic Notes Rosenblatt, Frank, 19, 239. See also Bibliographic Notes Rotation group, 43, 98, 120, 127 S Samuel, Arthur, 16, 209. See also Bibliographic Notes Sanna, R. See Bibliographic Notes Scene-analysis, 102, 232-239 Self-organizing, 16, 19, 234 Selfridge, Oliver G., 179, 245. See also Bibliographic Notes Serial, 17, 96, 127, 136-140, 232, 241 243 Shaw, J. C., 234. See also Bibliographic Notes Similarity group, 98 Simon, Herbert A., 234. See also Bibliographic Notes Solomonoflf, Ray, 234 Solution vector, 165 Spectrum, 70, 77, 107, 1 10, 134 Square geometric, 105, 122, 243 unit of retina, 44, 71 Statistics. See Probability Stratification, 1 14-128, 156, 227, 292 Strauss, Dona, 76, 244, 245 Support, 27 Sussman, Gerald, 245 Switch, 81, 241 Symmetry, 1 17, 242 T Template, 131, 230 Theorem, 226, 239 Three-dimensional predicates, 85, 148, 232 Threshold, 10 Threshold logic, 58. Name for theory of linear threshold functions. [292] Index Tinbergen, N. See Bibliographic Notes Toda, I. See Bibliographic Notes Tolerance, 71, 72, 124, 134, 142. Allow- able measurement errors. Topology, 9, 69, 71, 85 15, 134 135, 242 Toroidal, 98, 1 26. See also Torus Torus, 17,44, 80,85,98, 126 Transitive group, 53 55 Translation group, 41, 46, 96, 98, 99 101, 105 106, 1 14, 1 18, 120, 124, 159 Translation-invariant, 56, 239 Translation spectra, 105. See also Spectrum Triangle, 130 true, 26 Turing machine, 139, 142 Twins, 114, 127, 242 U Uttley, A. M. See Bibliographic Notes Unbounded, 127. See also Infinite V Vector, 188, 240 Vector geometry, 165 W Weight, 10 Weight of evidence, 204, 238 Weisel, T. N. See Bibliographic Notes White, John, 243 X x, 25. A point of the retina R. xeX,27 X, 5, 26. A picture, that is, a subset of/?. 10, 49. Notation for coefficient of <p(X). a(<p), alternative notation for a tp . 5, 26 32 P CIRCLE ■> ^ P CONNECTED ■> 6, 8, 12, 13 P CONVEX i ^ ’/'PARITY i 7)1 P SYMMETRICAL ’ ^ 1 7 v?, 26. A partial predicate. <p h 5, 10 (p A (A'), 7. The predicate \A C X]. <£, 5, 8, 10. A family of predicates. <£>,, 53. An equivalence class of predi cates. <i>, a vector from a family F <F, unit vector, 166 Try, 115 = , 27. The Boolean equivalence predi cate. [ 1, 26. Brackets for predicates. Perceptrons Expanded Edition by Marvin L. Minsky and Seymour A. Papert Perceptrons — the first systematic study of parallelism in computation — has remained a classical work on threshold automata networks for nearly two decades. It marked a historical turn in artificial intelligence and is required reading for anyone who wants to understand the connectionist revival that is going on today. Artificial-intelligence research, which for a time concentrated on the pro- gramming of von Neumann computers, is swinging back to the idea that intelligence might emerge from the activity of networks of neuron-ilike entities. Minsky and Papert’s book was the first example of a mafftemati- cal analysis carried far enough to show the exact limitations of a class of computing machines that could seriously be considered as models of the brain. \ \ v Now the new developments in mathematical tools, the recent interest of physicists in the theory of disordered matter, the new insights into and psychological models of how the brain works, and the evolution of fast computers that can simulate networks of automata have given Percep- trons new importance. Minsky and Papert have added a chapter to their seminal study in which they discuss the current state of parallel computers, review developments since the appearance of the 1972 edition, and identify new research direc- tions related to connectionism. The central theoretical challenge facing connectionism, they observe, is in reaching a deeper understanding of how "objects" or "agents" with individuality can emerge in a network. Progress in this area would link connectionism with what the authors have called "society theories of mind.” Marvin L. Minsky is Donner Professor of Science in MIT's Electrical Engineering and Computer Science Department, and Seymour A. Papert is Professor of Media Technology at MIT. The MIT Press Massachusetts Institute of Technology Cambridge. Massachusetts 02142 MINPPR 0-262-63111-3